This is a project where my group tried to model Airbnb prices in Mexico city using multivariate regressions.
listings <- vroom("http://data.insideairbnb.com/mexico/df/mexico-city/2020-06-20/data/listings.csv.gz")
glimpse(listings)
## Rows: 21,824
## Columns: 106
## $ id <dbl> 35797, 56074, 61792, 7...
## $ listing_url <chr> "https://www.airbnb.co...
## $ scrape_id <dbl> 2.02e+13, 2.02e+13, 2....
## $ last_scraped <date> 2020-06-23, 2020-06-2...
## $ name <chr> "Villa Dante", "Great ...
## $ summary <chr> "Dentro de Villa un es...
## $ space <chr> "please go to (URL HID...
## $ description <chr> "Dentro de Villa un es...
## $ experiences_offered <chr> "none", "none", "none"...
## $ neighborhood_overview <chr> "Centro comercial Sant...
## $ notes <chr> "Si te gustan la tipo ...
## $ transit <chr> "Uber es buena opción ...
## $ access <chr> "Jardin muy Amplio.", ...
## $ interaction <chr> "Cualquier duda contác...
## $ house_rules <chr> "Se renta un estudio ...
## $ thumbnail_url <lgl> NA, NA, NA, NA, NA, NA...
## $ medium_url <lgl> NA, NA, NA, NA, NA, NA...
## $ picture_url <chr> "https://a0.muscache.c...
## $ xl_picture_url <lgl> NA, NA, NA, NA, NA, NA...
## $ host_id <dbl> 153786, 265650, 299558...
## $ host_url <chr> "https://www.airbnb.co...
## $ host_name <chr> "Dici", "Maris", "Robe...
## $ host_since <date> 2010-06-28, 2010-10-1...
## $ host_location <chr> "Mexico City, Mexico C...
## $ host_about <chr> "Master in visual arts...
## $ host_response_time <chr> "N/A", "within an hour...
## $ host_response_rate <chr> "N/A", "100%", "100%",...
## $ host_acceptance_rate <chr> "N/A", "91%", "67%", "...
## $ host_is_superhost <lgl> FALSE, TRUE, FALSE, TR...
## $ host_thumbnail_url <chr> "https://a0.muscache.c...
## $ host_picture_url <chr> "https://a0.muscache.c...
## $ host_neighbourhood <chr> NA, "San Rafael", "Con...
## $ host_listings_count <dbl> 2, 2, 1, 4, 3, 3, 4, 2...
## $ host_total_listings_count <dbl> 2, 2, 1, 4, 3, 3, 4, 2...
## $ host_verifications <chr> "['email', 'phone', 'r...
## $ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE...
## $ host_identity_verified <lgl> FALSE, FALSE, FALSE, T...
## $ street <chr> "Mexico City, D.f., Me...
## $ neighbourhood <chr> NA, "San Rafael", "Con...
## $ neighbourhood_cleansed <chr> "Cuajimalpa de Morelos...
## $ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA...
## $ city <chr> "Mexico City", "Mexico...
## $ state <chr> "D.f.", "DF", "Ciudad ...
## $ zipcode <chr> NA, NA, "06140", "0410...
## $ market <chr> "Mexico City", "Mexico...
## $ smart_location <chr> "Mexico City, Mexico",...
## $ country_code <chr> "MX", "MX", "MX", "MX"...
## $ country <chr> "Mexico", "Mexico", "M...
## $ latitude <dbl> 19.4, 19.4, 19.4, 19.4...
## $ longitude <dbl> -99.3, -99.2, -99.2, -...
## $ is_location_exact <lgl> FALSE, TRUE, TRUE, TRU...
## $ property_type <chr> "Villa", "Condominium"...
## $ room_type <chr> "Entire home/apt", "En...
## $ accommodates <dbl> 2, 3, 2, 2, 2, 2, 14, ...
## $ bathrooms <dbl> 1.0, 1.0, 1.0, 1.0, 1....
## $ bedrooms <dbl> 1, 1, 1, 1, 1, 1, 4, 1...
## $ beds <dbl> 1, 2, 1, 1, 1, 1, 10, ...
## $ bed_type <chr> "Futon", "Real Bed", "...
## $ amenities <chr> "{Wifi,Kitchen,\"Free ...
## $ square_feet <dbl> 32292, 646, 161, NA, 1...
## $ price <chr> "$4,500.00", "$843.00"...
## $ weekly_price <chr> NA, "$4,740.00", NA, "...
## $ monthly_price <chr> "$124,995.00", "$15,72...
## $ security_deposit <chr> NA, "$2,279.00", "$11,...
## $ cleaning_fee <chr> NA, "$684.00", "$340.0...
## $ guests_included <dbl> 1, 2, 2, 2, 1, 1, 6, 1...
## $ extra_people <chr> "$0.00", "$342.00", "$...
## $ minimum_nights <dbl> 1, 4, 2, 6, 4, 1, 2, 4...
## $ maximum_nights <dbl> 7, 150, 21, 180, 365, ...
## $ minimum_minimum_nights <dbl> 1, 4, 2, 6, 4, 1, 2, 4...
## $ maximum_minimum_nights <dbl> 1, 4, 2, 6, 4, 1, 2, 4...
## $ minimum_maximum_nights <dbl> 7, 1125, 21, 180, 365,...
## $ maximum_maximum_nights <dbl> 7, 1125, 21, 180, 365,...
## $ minimum_nights_avg_ntm <dbl> 1.0, 4.0, 2.0, 6.0, 4....
## $ maximum_nights_avg_ntm <dbl> 7, 1125, 21, 180, 365,...
## $ calendar_updated <chr> "35 months ago", "4 we...
## $ has_availability <lgl> TRUE, TRUE, TRUE, TRUE...
## $ availability_30 <dbl> 23, 0, 30, 28, 0, 0, 2...
## $ availability_60 <dbl> 53, 0, 60, 58, 19, 0, ...
## $ availability_90 <dbl> 83, 0, 90, 88, 49, 18,...
## $ availability_365 <dbl> 358, 0, 180, 363, 319,...
## $ calendar_last_scraped <date> 2020-06-23, 2020-06-2...
## $ number_of_reviews <dbl> 0, 60, 52, 102, 10, 0,...
## $ number_of_reviews_ltm <dbl> 0, 2, 1, 10, 2, 0, 11,...
## $ first_review <date> NA, 2017-11-18, 2017-...
## $ last_review <date> NA, 2019-07-24, 2019-...
## $ review_scores_rating <dbl> NA, 97, 98, 98, 100, N...
## $ review_scores_accuracy <dbl> NA, 10, 10, 10, 10, NA...
## $ review_scores_cleanliness <dbl> NA, 10, 10, 10, 10, NA...
## $ review_scores_checkin <dbl> NA, 10, 10, 10, 10, NA...
## $ review_scores_communication <dbl> NA, 10, 10, 10, 10, NA...
## $ review_scores_location <dbl> NA, 10, 10, 10, 10, NA...
## $ review_scores_value <dbl> NA, 10, 10, 10, 10, NA...
## $ requires_license <lgl> FALSE, FALSE, FALSE, F...
## $ license <lgl> NA, NA, NA, NA, NA, NA...
## $ jurisdiction_names <chr> "{\"Mexico City\",\" M...
## $ instant_bookable <lgl> FALSE, TRUE, FALSE, FA...
## $ is_business_travel_ready <lgl> FALSE, FALSE, FALSE, F...
## $ cancellation_policy <chr> "flexible", "moderate"...
## $ require_guest_profile_picture <lgl> FALSE, FALSE, FALSE, F...
## $ require_guest_phone_verification <lgl> FALSE, FALSE, FALSE, F...
## $ calculated_host_listings_count <dbl> 1, 2, 2, 3, 2, 3, 4, 2...
## $ calculated_host_listings_count_entire_homes <dbl> 1, 2, 0, 2, 2, 1, 2, 0...
## $ calculated_host_listings_count_private_rooms <dbl> 0, 0, 2, 1, 0, 2, 2, 2...
## $ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0...
## $ reviews_per_month <dbl> NA, 1.89, 1.62, 1.00, ...
We have 21824 rows of data and 106 columns, i.e. variables. We can see the following data types: - Strings (
skim(listings)
| Name | listings |
| Number of rows | 21824 |
| Number of columns | 106 |
| _______________________ | |
| Column type frequency: | |
| character | 47 |
| Date | 5 |
| logical | 15 |
| numeric | 39 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 34 | 37 | 0 | 21824 | 0 |
| name | 8 | 1.00 | 1 | 255 | 0 | 21094 | 0 |
| summary | 1443 | 0.93 | 1 | 1000 | 0 | 18458 | 0 |
| space | 6009 | 0.72 | 1 | 1000 | 0 | 14005 | 0 |
| description | 1141 | 0.95 | 1 | 1000 | 0 | 19519 | 0 |
| experiences_offered | 0 | 1.00 | 4 | 4 | 0 | 1 | 0 |
| neighborhood_overview | 6312 | 0.71 | 1 | 1000 | 0 | 12861 | 0 |
| notes | 13372 | 0.39 | 1 | 1000 | 0 | 7097 | 0 |
| transit | 7254 | 0.67 | 1 | 1000 | 0 | 12094 | 0 |
| access | 10332 | 0.53 | 1 | 1000 | 0 | 9843 | 0 |
| interaction | 7743 | 0.65 | 1 | 1000 | 0 | 11412 | 0 |
| house_rules | 9438 | 0.57 | 1 | 1000 | 0 | 10357 | 0 |
| picture_url | 0 | 1.00 | 35 | 146 | 0 | 21216 | 0 |
| host_url | 0 | 1.00 | 38 | 43 | 0 | 13139 | 0 |
| host_name | 0 | 1.00 | 1 | 35 | 0 | 4149 | 0 |
| host_location | 75 | 1.00 | 1 | 104 | 0 | 625 | 0 |
| host_about | 8718 | 0.60 | 1 | 5443 | 0 | 7251 | 8 |
| host_response_time | 0 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 0 | 1.00 | 2 | 4 | 0 | 52 | 0 |
| host_acceptance_rate | 0 | 1.00 | 2 | 4 | 0 | 81 | 0 |
| host_thumbnail_url | 0 | 1.00 | 55 | 106 | 0 | 13103 | 0 |
| host_picture_url | 0 | 1.00 | 57 | 109 | 0 | 13103 | 0 |
| host_neighbourhood | 9299 | 0.57 | 3 | 39 | 0 | 171 | 0 |
| host_verifications | 0 | 1.00 | 2 | 161 | 0 | 304 | 0 |
| street | 0 | 1.00 | 10 | 166 | 0 | 528 | 0 |
| neighbourhood | 4894 | 0.78 | 4 | 33 | 0 | 54 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 7 | 22 | 0 | 16 | 0 |
| city | 30 | 1.00 | 2 | 146 | 0 | 270 | 0 |
| state | 150 | 0.99 | 2 | 38 | 0 | 120 | 0 |
| zipcode | 1172 | 0.95 | 4 | 31 | 0 | 865 | 0 |
| market | 25 | 1.00 | 8 | 21 | 0 | 5 | 0 |
| smart_location | 0 | 1.00 | 6 | 154 | 0 | 291 | 0 |
| country_code | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| country | 0 | 1.00 | 6 | 6 | 0 | 1 | 0 |
| property_type | 0 | 1.00 | 3 | 23 | 0 | 37 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bed_type | 4 | 1.00 | 5 | 13 | 0 | 5 | 0 |
| amenities | 0 | 1.00 | 2 | 1714 | 0 | 20546 | 0 |
| price | 0 | 1.00 | 5 | 11 | 0 | 1042 | 0 |
| weekly_price | 20852 | 0.04 | 6 | 11 | 0 | 573 | 0 |
| monthly_price | 20873 | 0.04 | 7 | 13 | 0 | 603 | 0 |
| security_deposit | 9696 | 0.56 | 5 | 11 | 0 | 671 | 0 |
| cleaning_fee | 7439 | 0.66 | 5 | 10 | 0 | 524 | 0 |
| extra_people | 0 | 1.00 | 5 | 9 | 0 | 392 | 0 |
| calendar_updated | 0 | 1.00 | 5 | 14 | 0 | 88 | 0 |
| jurisdiction_names | 331 | 0.98 | 18 | 110 | 0 | 17 | 0 |
| cancellation_policy | 1 | 1.00 | 8 | 27 | 0 | 5 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2020-06-20 | 2020-06-26 | 2020-06-21 | 7 |
| host_since | 0 | 1.00 | 2009-02-03 | 2020-06-16 | 2016-11-01 | 2947 |
| calendar_last_scraped | 0 | 1.00 | 2020-06-20 | 2020-06-26 | 2020-06-21 | 7 |
| first_review | 5371 | 0.75 | 2011-07-28 | 2020-06-22 | 2018-12-04 | 2113 |
| last_review | 5371 | 0.75 | 2013-12-21 | 2020-06-25 | 2020-03-03 | 1287 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| thumbnail_url | 21824 | 0 | NaN | : |
| medium_url | 21824 | 0 | NaN | : |
| xl_picture_url | 21824 | 0 | NaN | : |
| host_is_superhost | 0 | 1 | 0.34 | FAL: 14359, TRU: 7465 |
| host_has_profile_pic | 0 | 1 | 1.00 | TRU: 21784, FAL: 40 |
| host_identity_verified | 0 | 1 | 0.30 | FAL: 15266, TRU: 6558 |
| neighbourhood_group_cleansed | 21824 | 0 | NaN | : |
| is_location_exact | 0 | 1 | 0.83 | TRU: 18037, FAL: 3787 |
| has_availability | 0 | 1 | 1.00 | TRU: 21824 |
| requires_license | 0 | 1 | 0.00 | FAL: 21824 |
| license | 21824 | 0 | NaN | : |
| instant_bookable | 0 | 1 | 0.57 | TRU: 12471, FAL: 9353 |
| is_business_travel_ready | 0 | 1 | 0.00 | FAL: 21824 |
| require_guest_profile_picture | 0 | 1 | 0.01 | FAL: 21665, TRU: 159 |
| require_guest_phone_verification | 0 | 1 | 0.01 | FAL: 21666, TRU: 158 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.84e+07 | 1.13e+07 | 3.58e+04 | 2.00e+07 | 3.02e+07 | 3.86e+07 | 4.39e+07 | ▂▃▅▅▇ |
| scrape_id | 0 | 1.00 | 2.02e+13 | 0.00e+00 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | ▁▁▇▁▁ |
| host_id | 0 | 1.00 | 1.22e+08 | 9.68e+07 | 7.36e+03 | 3.75e+07 | 1.02e+08 | 1.94e+08 | 3.50e+08 | ▇▅▃▃▂ |
| host_listings_count | 0 | 1.00 | 3.04e+01 | 2.82e+02 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 5.00e+00 | 3.33e+03 | ▇▁▁▁▁ |
| host_total_listings_count | 0 | 1.00 | 3.04e+01 | 2.82e+02 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 5.00e+00 | 3.33e+03 | ▇▁▁▁▁ |
| latitude | 0 | 1.00 | 1.94e+01 | 5.00e-02 | 1.92e+01 | 1.94e+01 | 1.94e+01 | 1.94e+01 | 1.96e+01 | ▁▁▅▇▁ |
| longitude | 0 | 1.00 | -9.92e+01 | 4.00e-02 | -9.93e+01 | -9.92e+01 | -9.92e+01 | -9.92e+01 | -9.90e+01 | ▁▁▇▁▁ |
| accommodates | 0 | 1.00 | 3.05e+00 | 2.21e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 4.00e+00 | 5.00e+01 | ▇▁▁▁▁ |
| bathrooms | 26 | 1.00 | 1.40e+00 | 1.00e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.50e+00 | 5.00e+01 | ▇▁▁▁▁ |
| bedrooms | 48 | 1.00 | 1.42e+00 | 1.14e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 5.00e+01 | ▇▁▁▁▁ |
| beds | 274 | 0.99 | 1.83e+00 | 1.75e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 5.00e+01 | ▇▁▁▁▁ |
| square_feet | 21758 | 0.00 | 1.07e+03 | 3.99e+03 | 0.00e+00 | 0.00e+00 | 2.37e+02 | 8.40e+02 | 3.23e+04 | ▇▁▁▁▁ |
| guests_included | 0 | 1.00 | 1.62e+00 | 1.34e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.60e+01 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 4.09e+00 | 2.24e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 1.12e+03 | ▇▁▁▁▁ |
| maximum_nights | 0 | 1.00 | 6.86e+02 | 7.08e+02 | 1.00e+00 | 4.50e+01 | 1.12e+03 | 1.12e+03 | 5.00e+04 | ▇▁▁▁▁ |
| minimum_minimum_nights | 0 | 1.00 | 3.89e+00 | 1.96e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 1.12e+03 | ▇▁▁▁▁ |
| maximum_minimum_nights | 0 | 1.00 | 4.20e+00 | 2.26e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 1.12e+03 | ▇▁▁▁▁ |
| minimum_maximum_nights | 0 | 1.00 | 8.47e+02 | 6.70e+02 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 5.00e+04 | ▇▁▁▁▁ |
| maximum_maximum_nights | 0 | 1.00 | 8.49e+02 | 6.69e+02 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 5.00e+04 | ▇▁▁▁▁ |
| minimum_nights_avg_ntm | 0 | 1.00 | 4.04e+00 | 2.06e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 1.12e+03 | ▇▁▁▁▁ |
| maximum_nights_avg_ntm | 0 | 1.00 | 8.48e+02 | 6.69e+02 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 5.00e+04 | ▇▁▁▁▁ |
| availability_30 | 0 | 1.00 | 1.95e+01 | 1.24e+01 | 0.00e+00 | 1.00e+00 | 2.60e+01 | 3.00e+01 | 3.00e+01 | ▃▁▁▂▇ |
| availability_60 | 0 | 1.00 | 4.21e+01 | 2.38e+01 | 0.00e+00 | 2.20e+01 | 5.60e+01 | 6.00e+01 | 6.00e+01 | ▂▁▁▁▇ |
| availability_90 | 0 | 1.00 | 6.54e+01 | 3.47e+01 | 0.00e+00 | 4.90e+01 | 8.50e+01 | 9.00e+01 | 9.00e+01 | ▂▁▁▁▇ |
| availability_365 | 0 | 1.00 | 2.20e+02 | 1.39e+02 | 0.00e+00 | 8.90e+01 | 2.11e+02 | 3.61e+02 | 3.65e+02 | ▃▂▃▁▇ |
| number_of_reviews | 0 | 1.00 | 2.36e+01 | 4.18e+01 | 0.00e+00 | 1.00e+00 | 6.00e+00 | 2.80e+01 | 5.55e+02 | ▇▁▁▁▁ |
| number_of_reviews_ltm | 0 | 1.00 | 9.18e+00 | 1.50e+01 | 0.00e+00 | 0.00e+00 | 2.00e+00 | 1.20e+01 | 1.77e+02 | ▇▁▁▁▁ |
| review_scores_rating | 5616 | 0.74 | 9.51e+01 | 8.64e+00 | 2.00e+01 | 9.40e+01 | 9.70e+01 | 1.00e+02 | 1.00e+02 | ▁▁▁▁▇ |
| review_scores_accuracy | 5632 | 0.74 | 9.72e+00 | 8.50e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_cleanliness | 5632 | 0.74 | 9.59e+00 | 9.20e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_checkin | 5635 | 0.74 | 9.82e+00 | 7.20e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_communication | 5631 | 0.74 | 9.78e+00 | 7.90e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_location | 5636 | 0.74 | 9.81e+00 | 6.80e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_value | 5637 | 0.74 | 9.58e+00 | 9.00e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| calculated_host_listings_count | 0 | 1.00 | 6.09e+00 | 1.60e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 4.00e+00 | 1.57e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 4.09e+00 | 1.55e+01 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 1.57e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 1.77e+00 | 4.20e+00 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 4.80e+01 | ▇▁▁▁▁ |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 9.00e-02 | 8.30e-01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 2.00e+01 | ▇▁▁▁▁ |
| reviews_per_month | 5371 | 0.75 | 1.36e+00 | 1.50e+00 | 1.00e-02 | 3.00e-01 | 8.20e-01 | 1.92e+00 | 1.47e+01 | ▇▁▁▁▁ |
The table above provides summary statistics for variables in the listings data frame and helps to improve our understanding of it.
The table above has already grouped variables by data type. At first there are 39 numeric variables. However, we also noticed that some variables, such as price, are currently in character form, albeit being potentially more useful in numeric form. Let’s change them to number variables.
listings2 <- listings %>%
# Convert prices from characters to numeric format ( remove '$' and ',' then convert to number)
mutate(price = as.numeric(gsub('[$,]', '', price)),
weekly_price = as.numeric(gsub('[$,]', '', weekly_price)),
monthly_price = as.numeric(gsub('[$,]', '', monthly_price)),
security_deposit = as.numeric(gsub('[$,]', '', security_deposit)),
cleaning_fee = as.numeric(gsub('[$,]', '', cleaning_fee)),
extra_people = as.numeric(gsub('[$,]', '', extra_people))) %>%
# Convert percentages from characters to numeric format ( remove '%' then convert to number)
mutate(host_response_rate = as.numeric(gsub('[%]', '', host_response_rate)) / 100,
host_acceptance_rate = as.numeric(gsub('[%]', '', host_acceptance_rate)) / 100)
Now we check to ensure that they are numeric variables, and then display the new list of numeric variables.
# Use typeof to check if previously reformatted variables are now numeric variables
typeof(listings2$price)
## [1] "double"
typeof(listings2$weekly_price)
## [1] "double"
typeof(listings2$security_deposit)
## [1] "double"
typeof(listings2$cleaning_fee)
## [1] "double"
typeof(listings2$extra_people)
## [1] "double"
typeof(listings2$host_response_rate)
## [1] "double"
typeof(listings2$host_acceptance_rate)
## [1] "double"
listings2 %>%
# Show only columns with numerical variables
select_if(is.numeric) %>%
# Use pivot.longer() to list numerical variables in a vertical list, values_to is just arbitrary as this value column will be deselected in the next step
pivot_longer(c(1:47), names_to = "Numerical_variables", values_to = "value") %>%
# Show column with variables name only
select(-value) %>%
# Take the top 47 rows only, as the rest are just repeated names
head(47)
| Numerical_variables |
|---|
| id |
| scrape_id |
| host_id |
| host_response_rate |
| host_acceptance_rate |
| host_listings_count |
| host_total_listings_count |
| latitude |
| longitude |
| accommodates |
| bathrooms |
| bedrooms |
| beds |
| square_feet |
| price |
| weekly_price |
| monthly_price |
| security_deposit |
| cleaning_fee |
| guests_included |
| extra_people |
| minimum_nights |
| maximum_nights |
| minimum_minimum_nights |
| maximum_minimum_nights |
| minimum_maximum_nights |
| maximum_maximum_nights |
| minimum_nights_avg_ntm |
| maximum_nights_avg_ntm |
| availability_30 |
| availability_60 |
| availability_90 |
| availability_365 |
| number_of_reviews |
| number_of_reviews_ltm |
| review_scores_rating |
| review_scores_accuracy |
| review_scores_cleanliness |
| review_scores_checkin |
| review_scores_communication |
| review_scores_location |
| review_scores_value |
| calculated_host_listings_count |
| calculated_host_listings_count_entire_homes |
| calculated_host_listings_count_private_rooms |
| calculated_host_listings_count_shared_rooms |
| reviews_per_month |
We use the skim function again to see the number of unique variables per column. This helps us identify which category might contain factor variables. We can then use the distinct function to display the distinct components of these categories.
skim(listings2)
| Name | listings2 |
| Number of rows | 21824 |
| Number of columns | 106 |
| _______________________ | |
| Column type frequency: | |
| character | 39 |
| Date | 5 |
| logical | 15 |
| numeric | 47 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 34 | 37 | 0 | 21824 | 0 |
| name | 8 | 1.00 | 1 | 255 | 0 | 21094 | 0 |
| summary | 1443 | 0.93 | 1 | 1000 | 0 | 18458 | 0 |
| space | 6009 | 0.72 | 1 | 1000 | 0 | 14005 | 0 |
| description | 1141 | 0.95 | 1 | 1000 | 0 | 19519 | 0 |
| experiences_offered | 0 | 1.00 | 4 | 4 | 0 | 1 | 0 |
| neighborhood_overview | 6312 | 0.71 | 1 | 1000 | 0 | 12861 | 0 |
| notes | 13372 | 0.39 | 1 | 1000 | 0 | 7097 | 0 |
| transit | 7254 | 0.67 | 1 | 1000 | 0 | 12094 | 0 |
| access | 10332 | 0.53 | 1 | 1000 | 0 | 9843 | 0 |
| interaction | 7743 | 0.65 | 1 | 1000 | 0 | 11412 | 0 |
| house_rules | 9438 | 0.57 | 1 | 1000 | 0 | 10357 | 0 |
| picture_url | 0 | 1.00 | 35 | 146 | 0 | 21216 | 0 |
| host_url | 0 | 1.00 | 38 | 43 | 0 | 13139 | 0 |
| host_name | 0 | 1.00 | 1 | 35 | 0 | 4149 | 0 |
| host_location | 75 | 1.00 | 1 | 104 | 0 | 625 | 0 |
| host_about | 8718 | 0.60 | 1 | 5443 | 0 | 7251 | 8 |
| host_response_time | 0 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_thumbnail_url | 0 | 1.00 | 55 | 106 | 0 | 13103 | 0 |
| host_picture_url | 0 | 1.00 | 57 | 109 | 0 | 13103 | 0 |
| host_neighbourhood | 9299 | 0.57 | 3 | 39 | 0 | 171 | 0 |
| host_verifications | 0 | 1.00 | 2 | 161 | 0 | 304 | 0 |
| street | 0 | 1.00 | 10 | 166 | 0 | 528 | 0 |
| neighbourhood | 4894 | 0.78 | 4 | 33 | 0 | 54 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 7 | 22 | 0 | 16 | 0 |
| city | 30 | 1.00 | 2 | 146 | 0 | 270 | 0 |
| state | 150 | 0.99 | 2 | 38 | 0 | 120 | 0 |
| zipcode | 1172 | 0.95 | 4 | 31 | 0 | 865 | 0 |
| market | 25 | 1.00 | 8 | 21 | 0 | 5 | 0 |
| smart_location | 0 | 1.00 | 6 | 154 | 0 | 291 | 0 |
| country_code | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| country | 0 | 1.00 | 6 | 6 | 0 | 1 | 0 |
| property_type | 0 | 1.00 | 3 | 23 | 0 | 37 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bed_type | 4 | 1.00 | 5 | 13 | 0 | 5 | 0 |
| amenities | 0 | 1.00 | 2 | 1714 | 0 | 20546 | 0 |
| calendar_updated | 0 | 1.00 | 5 | 14 | 0 | 88 | 0 |
| jurisdiction_names | 331 | 0.98 | 18 | 110 | 0 | 17 | 0 |
| cancellation_policy | 1 | 1.00 | 8 | 27 | 0 | 5 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2020-06-20 | 2020-06-26 | 2020-06-21 | 7 |
| host_since | 0 | 1.00 | 2009-02-03 | 2020-06-16 | 2016-11-01 | 2947 |
| calendar_last_scraped | 0 | 1.00 | 2020-06-20 | 2020-06-26 | 2020-06-21 | 7 |
| first_review | 5371 | 0.75 | 2011-07-28 | 2020-06-22 | 2018-12-04 | 2113 |
| last_review | 5371 | 0.75 | 2013-12-21 | 2020-06-25 | 2020-03-03 | 1287 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| thumbnail_url | 21824 | 0 | NaN | : |
| medium_url | 21824 | 0 | NaN | : |
| xl_picture_url | 21824 | 0 | NaN | : |
| host_is_superhost | 0 | 1 | 0.34 | FAL: 14359, TRU: 7465 |
| host_has_profile_pic | 0 | 1 | 1.00 | TRU: 21784, FAL: 40 |
| host_identity_verified | 0 | 1 | 0.30 | FAL: 15266, TRU: 6558 |
| neighbourhood_group_cleansed | 21824 | 0 | NaN | : |
| is_location_exact | 0 | 1 | 0.83 | TRU: 18037, FAL: 3787 |
| has_availability | 0 | 1 | 1.00 | TRU: 21824 |
| requires_license | 0 | 1 | 0.00 | FAL: 21824 |
| license | 21824 | 0 | NaN | : |
| instant_bookable | 0 | 1 | 0.57 | TRU: 12471, FAL: 9353 |
| is_business_travel_ready | 0 | 1 | 0.00 | FAL: 21824 |
| require_guest_profile_picture | 0 | 1 | 0.01 | FAL: 21665, TRU: 159 |
| require_guest_phone_verification | 0 | 1 | 0.01 | FAL: 21666, TRU: 158 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.84e+07 | 1.13e+07 | 3.58e+04 | 2.00e+07 | 3.02e+07 | 3.86e+07 | 4.39e+07 | ▂▃▅▅▇ |
| scrape_id | 0 | 1.00 | 2.02e+13 | 0.00e+00 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | ▁▁▇▁▁ |
| host_id | 0 | 1.00 | 1.22e+08 | 9.68e+07 | 7.36e+03 | 3.75e+07 | 1.02e+08 | 1.94e+08 | 3.50e+08 | ▇▅▃▃▂ |
| host_response_rate | 8450 | 0.61 | 9.10e-01 | 2.30e-01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | ▁▁▁▁▇ |
| host_acceptance_rate | 3868 | 0.82 | 8.90e-01 | 2.30e-01 | 0.00e+00 | 9.00e-01 | 9.90e-01 | 1.00e+00 | 1.00e+00 | ▁▁▁▁▇ |
| host_listings_count | 0 | 1.00 | 3.04e+01 | 2.82e+02 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 5.00e+00 | 3.33e+03 | ▇▁▁▁▁ |
| host_total_listings_count | 0 | 1.00 | 3.04e+01 | 2.82e+02 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 5.00e+00 | 3.33e+03 | ▇▁▁▁▁ |
| latitude | 0 | 1.00 | 1.94e+01 | 5.00e-02 | 1.92e+01 | 1.94e+01 | 1.94e+01 | 1.94e+01 | 1.96e+01 | ▁▁▅▇▁ |
| longitude | 0 | 1.00 | -9.92e+01 | 4.00e-02 | -9.93e+01 | -9.92e+01 | -9.92e+01 | -9.92e+01 | -9.90e+01 | ▁▁▇▁▁ |
| accommodates | 0 | 1.00 | 3.05e+00 | 2.21e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 4.00e+00 | 5.00e+01 | ▇▁▁▁▁ |
| bathrooms | 26 | 1.00 | 1.40e+00 | 1.00e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.50e+00 | 5.00e+01 | ▇▁▁▁▁ |
| bedrooms | 48 | 1.00 | 1.42e+00 | 1.14e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 5.00e+01 | ▇▁▁▁▁ |
| beds | 274 | 0.99 | 1.83e+00 | 1.75e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 5.00e+01 | ▇▁▁▁▁ |
| square_feet | 21758 | 0.00 | 1.07e+03 | 3.99e+03 | 0.00e+00 | 0.00e+00 | 2.37e+02 | 8.40e+02 | 3.23e+04 | ▇▁▁▁▁ |
| price | 0 | 1.00 | 1.48e+03 | 4.99e+03 | 0.00e+00 | 4.08e+02 | 7.25e+02 | 1.32e+03 | 3.50e+05 | ▇▁▁▁▁ |
| weekly_price | 20852 | 0.04 | 1.26e+04 | 2.88e+04 | 6.00e+01 | 2.63e+03 | 5.21e+03 | 1.01e+04 | 3.42e+05 | ▇▁▁▁▁ |
| monthly_price | 20873 | 0.04 | 4.86e+04 | 1.16e+05 | 2.00e+02 | 9.00e+03 | 1.74e+04 | 3.50e+04 | 1.51e+06 | ▇▁▁▁▁ |
| security_deposit | 9696 | 0.56 | 3.28e+03 | 8.94e+03 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 3.40e+03 | 1.15e+05 | ▇▁▁▁▁ |
| cleaning_fee | 7439 | 0.66 | 3.61e+02 | 5.23e+02 | 0.00e+00 | 1.14e+02 | 3.00e+02 | 4.53e+02 | 2.40e+04 | ▇▁▁▁▁ |
| guests_included | 0 | 1.00 | 1.62e+00 | 1.34e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.60e+01 | ▇▁▁▁▁ |
| extra_people | 0 | 1.00 | 1.23e+02 | 2.69e+02 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 2.00e+02 | 6.84e+03 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 4.09e+00 | 2.24e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 1.12e+03 | ▇▁▁▁▁ |
| maximum_nights | 0 | 1.00 | 6.86e+02 | 7.08e+02 | 1.00e+00 | 4.50e+01 | 1.12e+03 | 1.12e+03 | 5.00e+04 | ▇▁▁▁▁ |
| minimum_minimum_nights | 0 | 1.00 | 3.89e+00 | 1.96e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 1.12e+03 | ▇▁▁▁▁ |
| maximum_minimum_nights | 0 | 1.00 | 4.20e+00 | 2.26e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 1.12e+03 | ▇▁▁▁▁ |
| minimum_maximum_nights | 0 | 1.00 | 8.47e+02 | 6.70e+02 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 5.00e+04 | ▇▁▁▁▁ |
| maximum_maximum_nights | 0 | 1.00 | 8.49e+02 | 6.69e+02 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 5.00e+04 | ▇▁▁▁▁ |
| minimum_nights_avg_ntm | 0 | 1.00 | 4.04e+00 | 2.06e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 1.12e+03 | ▇▁▁▁▁ |
| maximum_nights_avg_ntm | 0 | 1.00 | 8.48e+02 | 6.69e+02 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 5.00e+04 | ▇▁▁▁▁ |
| availability_30 | 0 | 1.00 | 1.95e+01 | 1.24e+01 | 0.00e+00 | 1.00e+00 | 2.60e+01 | 3.00e+01 | 3.00e+01 | ▃▁▁▂▇ |
| availability_60 | 0 | 1.00 | 4.21e+01 | 2.38e+01 | 0.00e+00 | 2.20e+01 | 5.60e+01 | 6.00e+01 | 6.00e+01 | ▂▁▁▁▇ |
| availability_90 | 0 | 1.00 | 6.54e+01 | 3.47e+01 | 0.00e+00 | 4.90e+01 | 8.50e+01 | 9.00e+01 | 9.00e+01 | ▂▁▁▁▇ |
| availability_365 | 0 | 1.00 | 2.20e+02 | 1.39e+02 | 0.00e+00 | 8.90e+01 | 2.11e+02 | 3.61e+02 | 3.65e+02 | ▃▂▃▁▇ |
| number_of_reviews | 0 | 1.00 | 2.36e+01 | 4.18e+01 | 0.00e+00 | 1.00e+00 | 6.00e+00 | 2.80e+01 | 5.55e+02 | ▇▁▁▁▁ |
| number_of_reviews_ltm | 0 | 1.00 | 9.18e+00 | 1.50e+01 | 0.00e+00 | 0.00e+00 | 2.00e+00 | 1.20e+01 | 1.77e+02 | ▇▁▁▁▁ |
| review_scores_rating | 5616 | 0.74 | 9.51e+01 | 8.64e+00 | 2.00e+01 | 9.40e+01 | 9.70e+01 | 1.00e+02 | 1.00e+02 | ▁▁▁▁▇ |
| review_scores_accuracy | 5632 | 0.74 | 9.72e+00 | 8.50e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_cleanliness | 5632 | 0.74 | 9.59e+00 | 9.20e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_checkin | 5635 | 0.74 | 9.82e+00 | 7.20e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_communication | 5631 | 0.74 | 9.78e+00 | 7.90e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_location | 5636 | 0.74 | 9.81e+00 | 6.80e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_value | 5637 | 0.74 | 9.58e+00 | 9.00e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| calculated_host_listings_count | 0 | 1.00 | 6.09e+00 | 1.60e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 4.00e+00 | 1.57e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 4.09e+00 | 1.55e+01 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 1.57e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 1.77e+00 | 4.20e+00 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 4.80e+01 | ▇▁▁▁▁ |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 9.00e-02 | 8.30e-01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 2.00e+01 | ▇▁▁▁▁ |
| reviews_per_month | 5371 | 0.75 | 1.36e+00 | 1.50e+00 | 1.00e-02 | 3.00e-01 | 8.20e-01 | 1.92e+00 | 1.47e+01 | ▇▁▁▁▁ |
# Use distinct to see distinct components of each variable
listings %>% distinct(host_response_time)
| host_response_time |
|---|
| N/A |
| within an hour |
| within a day |
| within a few hours |
| a few days or more |
listings %>% distinct(market)
| market |
|---|
| Mexico City |
| Other (International) |
| Pochutla |
| Santander |
| Playa del Carmen |
listings %>% distinct(room_type)
| room_type |
|---|
| Entire home/apt |
| Private room |
| Hotel room |
| Shared room |
listings %>% distinct(bed_type)
| bed_type |
|---|
| Futon |
| Real Bed |
| Couch |
| Pull-out Sofa |
| Airbed |
listings %>% distinct(cancellation_policy)
| cancellation_policy |
|---|
| flexible |
| moderate |
| strict_14_with_grace_period |
| super_strict_30 |
| super_strict_60 |
listings %>% distinct(experiences_offered)
| experiences_offered |
|---|
| none |
listings %>% distinct(neighbourhood_cleansed)
| neighbourhood_cleansed |
|---|
| Cuajimalpa de Morelos |
| Cuauhtémoc |
| Coyoacán |
| Miguel Hidalgo |
| Benito Juárez |
| Azcapotzalco |
| Iztacalco |
| Tlalpan |
| La Magdalena Contreras |
| Venustiano Carranza |
| Álvaro Obregón |
| Gustavo A. Madero |
| Iztapalapa |
| Xochimilco |
| Tláhuac |
| Milpa Alta |
We find the following:
listings3 <- listings2 %>%
# Reorder factor variables where it is appropriate to do so
mutate(host_response_time = factor(host_response_time, c("N/A","within an hour", "within a few hours", "within a day", "a few days or more"),
levels = c("within an hour", "within a few hours", "within a day", "a few days or more", "N/A")),
cancellation_policy = factor(cancellation_policy, c("flexible", "moderate", "strict_14_with_grace_period", "super_strict_30", "super_strict_60"),
levels = c("flexible", "moderate", "strict_14_with_grace_period", "super_strict_30", "super_strict_60")),
# Save 4 most popular property types as individual categories, and the rest as "Other"
prop_type_simplified = case_when(property_type %in% c("Apartment","House", "Condominium","Loft") ~ property_type,
TRUE ~ "Other"),
# Create boolean for whether a real bed is included or not
real_bed = ifelse(bed_type == "Real Bed", TRUE, FALSE))
We have only formatted some character variables as factors. These are what we expect to be the most useful variables in factor form. The rest can formatted as factors later if needed.
We use the skim function again, this time to study the variables cleaning_fee and security_deposit:
skim(listings3$cleaning_fee)
| Name | listings3$cleaning_fee |
| Number of rows | 21824 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| data | 7439 | 0.66 | 361 | 523 | 0 | 114 | 300 | 453 | 24000 | ▇▁▁▁▁ |
skim(listings3$security_deposit)
| Name | listings3$security_deposi… |
| Number of rows | 21824 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| data | 9696 | 0.56 | 3278 | 8935 | 0 | 0 | 0 | 3397 | 115498 | ▇▁▁▁▁ |
We can see that:
Cleaning fee has multiple missing values. We should not exclude these as it means there is no cleaning fee, i.e. it is 0.
Similarly, security deposit also has multiple missing values. We should not exclude these as it means we have no deposit requirements, i.e. it is 0
For this reason, we decided to condition them to set a value of zero for all missing entries. .
listings4 <- listings3 %>%
# Set cleaning fees and security deposit to 0 if entry is NA
mutate(cleaning_fee = case_when(is.na(cleaning_fee) ~ 0,
TRUE ~ cleaning_fee),
security_deposit = case_when(is.na(security_deposit) ~ 0,
TRUE ~ security_deposit))
We know skim the entire data frame again to further understand other missing values
skim(listings4)
| Name | listings4 |
| Number of rows | 21824 |
| Number of columns | 108 |
| _______________________ | |
| Column type frequency: | |
| character | 38 |
| Date | 5 |
| factor | 2 |
| logical | 16 |
| numeric | 47 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 34 | 37 | 0 | 21824 | 0 |
| name | 8 | 1.00 | 1 | 255 | 0 | 21094 | 0 |
| summary | 1443 | 0.93 | 1 | 1000 | 0 | 18458 | 0 |
| space | 6009 | 0.72 | 1 | 1000 | 0 | 14005 | 0 |
| description | 1141 | 0.95 | 1 | 1000 | 0 | 19519 | 0 |
| experiences_offered | 0 | 1.00 | 4 | 4 | 0 | 1 | 0 |
| neighborhood_overview | 6312 | 0.71 | 1 | 1000 | 0 | 12861 | 0 |
| notes | 13372 | 0.39 | 1 | 1000 | 0 | 7097 | 0 |
| transit | 7254 | 0.67 | 1 | 1000 | 0 | 12094 | 0 |
| access | 10332 | 0.53 | 1 | 1000 | 0 | 9843 | 0 |
| interaction | 7743 | 0.65 | 1 | 1000 | 0 | 11412 | 0 |
| house_rules | 9438 | 0.57 | 1 | 1000 | 0 | 10357 | 0 |
| picture_url | 0 | 1.00 | 35 | 146 | 0 | 21216 | 0 |
| host_url | 0 | 1.00 | 38 | 43 | 0 | 13139 | 0 |
| host_name | 0 | 1.00 | 1 | 35 | 0 | 4149 | 0 |
| host_location | 75 | 1.00 | 1 | 104 | 0 | 625 | 0 |
| host_about | 8718 | 0.60 | 1 | 5443 | 0 | 7251 | 8 |
| host_thumbnail_url | 0 | 1.00 | 55 | 106 | 0 | 13103 | 0 |
| host_picture_url | 0 | 1.00 | 57 | 109 | 0 | 13103 | 0 |
| host_neighbourhood | 9299 | 0.57 | 3 | 39 | 0 | 171 | 0 |
| host_verifications | 0 | 1.00 | 2 | 161 | 0 | 304 | 0 |
| street | 0 | 1.00 | 10 | 166 | 0 | 528 | 0 |
| neighbourhood | 4894 | 0.78 | 4 | 33 | 0 | 54 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 7 | 22 | 0 | 16 | 0 |
| city | 30 | 1.00 | 2 | 146 | 0 | 270 | 0 |
| state | 150 | 0.99 | 2 | 38 | 0 | 120 | 0 |
| zipcode | 1172 | 0.95 | 4 | 31 | 0 | 865 | 0 |
| market | 25 | 1.00 | 8 | 21 | 0 | 5 | 0 |
| smart_location | 0 | 1.00 | 6 | 154 | 0 | 291 | 0 |
| country_code | 0 | 1.00 | 2 | 2 | 0 | 1 | 0 |
| country | 0 | 1.00 | 6 | 6 | 0 | 1 | 0 |
| property_type | 0 | 1.00 | 3 | 23 | 0 | 37 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bed_type | 4 | 1.00 | 5 | 13 | 0 | 5 | 0 |
| amenities | 0 | 1.00 | 2 | 1714 | 0 | 20546 | 0 |
| calendar_updated | 0 | 1.00 | 5 | 14 | 0 | 88 | 0 |
| jurisdiction_names | 331 | 0.98 | 18 | 110 | 0 | 17 | 0 |
| prop_type_simplified | 0 | 1.00 | 4 | 11 | 0 | 5 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2020-06-20 | 2020-06-26 | 2020-06-21 | 7 |
| host_since | 0 | 1.00 | 2009-02-03 | 2020-06-16 | 2016-11-01 | 2947 |
| calendar_last_scraped | 0 | 1.00 | 2020-06-20 | 2020-06-26 | 2020-06-21 | 7 |
| first_review | 5371 | 0.75 | 2011-07-28 | 2020-06-22 | 2018-12-04 | 2113 |
| last_review | 5371 | 0.75 | 2013-12-21 | 2020-06-25 | 2020-03-03 | 1287 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| host_response_time | 0 | 1 | FALSE | 5 | a f: 8450, N/A: 7865, wit: 2876, wit: 1858 |
| cancellation_policy | 1 | 1 | FALSE | 5 | fle: 11310, mod: 6015, str: 4467, sup: 28 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| thumbnail_url | 21824 | 0 | NaN | : |
| medium_url | 21824 | 0 | NaN | : |
| xl_picture_url | 21824 | 0 | NaN | : |
| host_is_superhost | 0 | 1 | 0.34 | FAL: 14359, TRU: 7465 |
| host_has_profile_pic | 0 | 1 | 1.00 | TRU: 21784, FAL: 40 |
| host_identity_verified | 0 | 1 | 0.30 | FAL: 15266, TRU: 6558 |
| neighbourhood_group_cleansed | 21824 | 0 | NaN | : |
| is_location_exact | 0 | 1 | 0.83 | TRU: 18037, FAL: 3787 |
| has_availability | 0 | 1 | 1.00 | TRU: 21824 |
| requires_license | 0 | 1 | 0.00 | FAL: 21824 |
| license | 21824 | 0 | NaN | : |
| instant_bookable | 0 | 1 | 0.57 | TRU: 12471, FAL: 9353 |
| is_business_travel_ready | 0 | 1 | 0.00 | FAL: 21824 |
| require_guest_profile_picture | 0 | 1 | 0.01 | FAL: 21665, TRU: 159 |
| require_guest_phone_verification | 0 | 1 | 0.01 | FAL: 21666, TRU: 158 |
| real_bed | 4 | 1 | 0.99 | TRU: 21664, FAL: 156 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.84e+07 | 1.13e+07 | 3.58e+04 | 2.00e+07 | 3.02e+07 | 3.86e+07 | 4.39e+07 | ▂▃▅▅▇ |
| scrape_id | 0 | 1.00 | 2.02e+13 | 0.00e+00 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | 2.02e+13 | ▁▁▇▁▁ |
| host_id | 0 | 1.00 | 1.22e+08 | 9.68e+07 | 7.36e+03 | 3.75e+07 | 1.02e+08 | 1.94e+08 | 3.50e+08 | ▇▅▃▃▂ |
| host_response_rate | 8450 | 0.61 | 9.10e-01 | 2.30e-01 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | ▁▁▁▁▇ |
| host_acceptance_rate | 3868 | 0.82 | 8.90e-01 | 2.30e-01 | 0.00e+00 | 9.00e-01 | 9.90e-01 | 1.00e+00 | 1.00e+00 | ▁▁▁▁▇ |
| host_listings_count | 0 | 1.00 | 3.04e+01 | 2.82e+02 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 5.00e+00 | 3.33e+03 | ▇▁▁▁▁ |
| host_total_listings_count | 0 | 1.00 | 3.04e+01 | 2.82e+02 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 5.00e+00 | 3.33e+03 | ▇▁▁▁▁ |
| latitude | 0 | 1.00 | 1.94e+01 | 5.00e-02 | 1.92e+01 | 1.94e+01 | 1.94e+01 | 1.94e+01 | 1.96e+01 | ▁▁▅▇▁ |
| longitude | 0 | 1.00 | -9.92e+01 | 4.00e-02 | -9.93e+01 | -9.92e+01 | -9.92e+01 | -9.92e+01 | -9.90e+01 | ▁▁▇▁▁ |
| accommodates | 0 | 1.00 | 3.05e+00 | 2.21e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 4.00e+00 | 5.00e+01 | ▇▁▁▁▁ |
| bathrooms | 26 | 1.00 | 1.40e+00 | 1.00e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 1.50e+00 | 5.00e+01 | ▇▁▁▁▁ |
| bedrooms | 48 | 1.00 | 1.42e+00 | 1.14e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 5.00e+01 | ▇▁▁▁▁ |
| beds | 274 | 0.99 | 1.83e+00 | 1.75e+00 | 0.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 5.00e+01 | ▇▁▁▁▁ |
| square_feet | 21758 | 0.00 | 1.07e+03 | 3.99e+03 | 0.00e+00 | 0.00e+00 | 2.37e+02 | 8.40e+02 | 3.23e+04 | ▇▁▁▁▁ |
| price | 0 | 1.00 | 1.48e+03 | 4.99e+03 | 0.00e+00 | 4.08e+02 | 7.25e+02 | 1.32e+03 | 3.50e+05 | ▇▁▁▁▁ |
| weekly_price | 20852 | 0.04 | 1.26e+04 | 2.88e+04 | 6.00e+01 | 2.63e+03 | 5.21e+03 | 1.01e+04 | 3.42e+05 | ▇▁▁▁▁ |
| monthly_price | 20873 | 0.04 | 4.86e+04 | 1.16e+05 | 2.00e+02 | 9.00e+03 | 1.74e+04 | 3.50e+04 | 1.51e+06 | ▇▁▁▁▁ |
| security_deposit | 0 | 1.00 | 1.82e+03 | 6.86e+03 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 2.00e+03 | 1.15e+05 | ▇▁▁▁▁ |
| cleaning_fee | 0 | 1.00 | 2.38e+02 | 4.58e+02 | 0.00e+00 | 0.00e+00 | 1.12e+02 | 3.50e+02 | 2.40e+04 | ▇▁▁▁▁ |
| guests_included | 0 | 1.00 | 1.62e+00 | 1.34e+00 | 1.00e+00 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.60e+01 | ▇▁▁▁▁ |
| extra_people | 0 | 1.00 | 1.23e+02 | 2.69e+02 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 2.00e+02 | 6.84e+03 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 4.09e+00 | 2.24e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 1.12e+03 | ▇▁▁▁▁ |
| maximum_nights | 0 | 1.00 | 6.86e+02 | 7.08e+02 | 1.00e+00 | 4.50e+01 | 1.12e+03 | 1.12e+03 | 5.00e+04 | ▇▁▁▁▁ |
| minimum_minimum_nights | 0 | 1.00 | 3.89e+00 | 1.96e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 1.12e+03 | ▇▁▁▁▁ |
| maximum_minimum_nights | 0 | 1.00 | 4.20e+00 | 2.26e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 3.00e+00 | 1.12e+03 | ▇▁▁▁▁ |
| minimum_maximum_nights | 0 | 1.00 | 8.47e+02 | 6.70e+02 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 5.00e+04 | ▇▁▁▁▁ |
| maximum_maximum_nights | 0 | 1.00 | 8.49e+02 | 6.69e+02 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 5.00e+04 | ▇▁▁▁▁ |
| minimum_nights_avg_ntm | 0 | 1.00 | 4.04e+00 | 2.06e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 2.00e+00 | 1.12e+03 | ▇▁▁▁▁ |
| maximum_nights_avg_ntm | 0 | 1.00 | 8.48e+02 | 6.69e+02 | 1.00e+00 | 3.65e+02 | 1.12e+03 | 1.12e+03 | 5.00e+04 | ▇▁▁▁▁ |
| availability_30 | 0 | 1.00 | 1.95e+01 | 1.24e+01 | 0.00e+00 | 1.00e+00 | 2.60e+01 | 3.00e+01 | 3.00e+01 | ▃▁▁▂▇ |
| availability_60 | 0 | 1.00 | 4.21e+01 | 2.38e+01 | 0.00e+00 | 2.20e+01 | 5.60e+01 | 6.00e+01 | 6.00e+01 | ▂▁▁▁▇ |
| availability_90 | 0 | 1.00 | 6.54e+01 | 3.47e+01 | 0.00e+00 | 4.90e+01 | 8.50e+01 | 9.00e+01 | 9.00e+01 | ▂▁▁▁▇ |
| availability_365 | 0 | 1.00 | 2.20e+02 | 1.39e+02 | 0.00e+00 | 8.90e+01 | 2.11e+02 | 3.61e+02 | 3.65e+02 | ▃▂▃▁▇ |
| number_of_reviews | 0 | 1.00 | 2.36e+01 | 4.18e+01 | 0.00e+00 | 1.00e+00 | 6.00e+00 | 2.80e+01 | 5.55e+02 | ▇▁▁▁▁ |
| number_of_reviews_ltm | 0 | 1.00 | 9.18e+00 | 1.50e+01 | 0.00e+00 | 0.00e+00 | 2.00e+00 | 1.20e+01 | 1.77e+02 | ▇▁▁▁▁ |
| review_scores_rating | 5616 | 0.74 | 9.51e+01 | 8.64e+00 | 2.00e+01 | 9.40e+01 | 9.70e+01 | 1.00e+02 | 1.00e+02 | ▁▁▁▁▇ |
| review_scores_accuracy | 5632 | 0.74 | 9.72e+00 | 8.50e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_cleanliness | 5632 | 0.74 | 9.59e+00 | 9.20e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_checkin | 5635 | 0.74 | 9.82e+00 | 7.20e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_communication | 5631 | 0.74 | 9.78e+00 | 7.90e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_location | 5636 | 0.74 | 9.81e+00 | 6.80e-01 | 2.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| review_scores_value | 5637 | 0.74 | 9.58e+00 | 9.00e-01 | 2.00e+00 | 9.00e+00 | 1.00e+01 | 1.00e+01 | 1.00e+01 | ▁▁▁▁▇ |
| calculated_host_listings_count | 0 | 1.00 | 6.09e+00 | 1.60e+01 | 1.00e+00 | 1.00e+00 | 2.00e+00 | 4.00e+00 | 1.57e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 4.09e+00 | 1.55e+01 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 1.57e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 1.77e+00 | 4.20e+00 | 0.00e+00 | 0.00e+00 | 1.00e+00 | 2.00e+00 | 4.80e+01 | ▇▁▁▁▁ |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 9.00e-02 | 8.30e-01 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 0.00e+00 | 2.00e+01 | ▇▁▁▁▁ |
| reviews_per_month | 5371 | 0.75 | 1.36e+00 | 1.50e+00 | 1.00e-02 | 3.00e-01 | 8.20e-01 | 1.92e+00 | 1.47e+01 | ▇▁▁▁▁ |
We found that:
What are the most common values for the variable minimum_nights?
count_minimum_nights <- listings4 %>%
# Count number of properties for each level of minimum nights
group_by(minimum_nights) %>%
summarise(count = n()) %>%
# Arrange in descending order
arrange(desc(count))
# Show list
count_minimum_nights
| minimum_nights | count |
|---|---|
| 1 | 10088 |
| 2 | 6474 |
| 3 | 2376 |
| 5 | 620 |
| 7 | 534 |
| 4 | 490 |
| 30 | 313 |
| 6 | 194 |
| 15 | 160 |
| 10 | 79 |
| 28 | 70 |
| 14 | 64 |
| 20 | 51 |
| 8 | 36 |
| 90 | 34 |
| 60 | 30 |
| 180 | 28 |
| 12 | 26 |
| 365 | 22 |
| 25 | 21 |
| 29 | 12 |
| 9 | 11 |
| 21 | 9 |
| 120 | 9 |
| 13 | 6 |
| 150 | 5 |
| 18 | 4 |
| 31 | 4 |
| 100 | 4 |
| 360 | 4 |
| 300 | 3 |
| 16 | 2 |
| 17 | 2 |
| 19 | 2 |
| 24 | 2 |
| 33 | 2 |
| 40 | 2 |
| 45 | 2 |
| 160 | 2 |
| 182 | 2 |
| 1e+03 | 2 |
| 11 | 1 |
| 22 | 1 |
| 26 | 1 |
| 35 | 1 |
| 39 | 1 |
| 54 | 1 |
| 55 | 1 |
| 61 | 1 |
| 63 | 1 |
| 80 | 1 |
| 89 | 1 |
| 92 | 1 |
| 93 | 1 |
| 99 | 1 |
| 132 | 1 |
| 179 | 1 |
| 183 | 1 |
| 350 | 1 |
| 363 | 1 |
| 400 | 1 |
| 500 | 1 |
| 1.12e+03 | 1 |
| 1.12e+03 | 1 |
179 days upwards: These properties most likely use Airbnb as if it was an alternative for Zoopla (as a marketing medium to advertise their properties to potential long-term tenants)
Is there any value among the common values that stands out?
30 days (this type of accommodation is not for short-term rents)
What is the likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights?
30-180 days: for long-term tourists or long-term business visitors. 179 days upwards: These properties most likely use Airbnb as if it was an alternative for Zoopla (as an online marketing medium to advertise their properties to potential long-term tenants)
Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. We only want to include listings in our regression analysis that are intended for travel purposes. When looking at the distribution of the minimum nights across listings we see that it is heavily left skewed. The majority of listings is as expected for short term renting, however there are some big outliers with over 100 days minimum nights stay. We will filter the data set to only include listings with a minimum night stay of <= 4. In addition, we will use our understanding of missing values in earlier parts to remove weekly prices, monthly prices, and square feet area.
# Filter properties with 4 minimum nights or less
listings5 <- listings4 %>%
filter(minimum_nights <= 4) %>%
## Deselect unwanted variables
select(-weekly_price, -monthly_price, -square_feet)
Our prior analyses have highlighted a list of variables that could be interesting for further analysis. In this section we will look at them in more detail
Histograms of host response and acceptance rates
# Investigate distribution of Host responses
listings5 %>%
ggplot() +
geom_histogram(aes(x= host_response_rate)) +
labs(title = "Most Airbnb hosts have 100% response rate",
subtitle = "Hosts with 0% Response Rate are potentially fake listings",
x = "Host Response Rate") +
theme_economist()
# Investigate distribution of Host Acceptance Rates
listings5 %>%
ggplot() +
geom_histogram(aes(x= host_acceptance_rate), binwidth = 0.05) +
labs(title = "Most Airbnb hosts have over 90% acceptance rate",
subtitle = "Hosts with 0% Acceptance Rate are potentially fake listings",
x = "Host Acceptance Rate") +
theme_economist()
Due to the competitiveness and strict hosting policies of Airbnb, most hosts tend to have very high acceptance and response rates.
listings5 %>%
# Filter out hosts who never responded or accepted any listings
filter(host_response_rate == 0,
host_acceptance_rate == 0) %>%
summarise(number_of_potentially_fake_listings = n())
| number_of_potentially_fake_listings |
|---|
| 40 |
# Count the number of hosts to be excluded
listings5 %>%
filter(host_acceptance_rate <= 0) %>%
summarise(to_be_removed = n())
| to_be_removed |
|---|
| 633 |
We can see that a substantial amount of hosts have never responded to requests and/or approved a request. We can be sure that hosts that don’t either respond to or accept any request are likely to be fake accounts. We can count 40 of these cases.
Extending our reasoning into acceptance rate and response rate individually, we believe hosts that don’t respond to requests but accept them might just be lazy, so we should not exclude them. However, hosts that don’t accept any request are potentially just using AirBnB to test the market and see whether there is any interest in the flat. Alternatively, they might have inflated price expectations that no one is willing to pay, or they are simply very new listings. Either way, we decided to exclude these 629 properties because they most likely are not good representations of the Airbnb market. This group already encompasses the aforementioned 40 listings that neither respond or accept requests, so we can just filter out any listings with 0 acceptance rate.
# Exclude all listings with hosts that do not respond or have never accepted
listings6 <- listings5 %>%
filter(host_acceptance_rate > 0)
# Calculate percentage of superhosts
listings6 %>%
mutate(superhosts = ifelse(host_is_superhost == TRUE,1,0)) %>%
summarise(hosts = n(),
superhosts = sum(superhosts)) %>%
summarise(perc_superhosts = superhosts/hosts)
| perc_superhosts |
|---|
| 0.433 |
We also found the percentage of superhosts to be at 43.8% - a healthy percentage of all hosts in Mexico City therefore regularly rents out their properties. Going forwards it will be interesting to see whether this has an effect on any other variables - especially price.
Outliers Included
With extreme outliers included, prices are heavily right skewed. Thus, boxplots of listings price are not visually comprehensible. Instead, boxplots of log prices will be plotted.
# Create simple density plot
listings6 %>%
ggplot() +
geom_histogram( aes(x = price), binwidth = 100) +
labs(title = "Price histogram heavily right-skewed",
subtitle = "Excessive price outliers should be investigated",
x = "Price per night") +
theme_economist() +
# Add dollar signs
scale_x_continuous(labels = scales::dollar_format())
# look for outliers - create boxplot on log of price data
listings6 %>%
mutate(log_price = log(price)) %>%
group_by(prop_type_simplified) %>%
ggplot() +
geom_boxplot(aes(x = reorder(prop_type_simplified, log_price, FUN = median), y = log_price)) +
theme(axis.title.x = element_blank()) +
labs(title = " Exponential price distribution shows that Houses are \n priced significantly lower than other property types",
subtitle = " Excessive price outliers should be excluded",
y = "Price - Log-Scale",
x = "Property type") +
theme_economist()
We saw a very skewed distribution of price. Outliers that charge for example over USD7,400 per night (or exp(8.91)) for a property do not seem to be legitimite properties and were therefore immediately excluded.
Extreme Outliers EXcluded
Filtering extreme outliers allow for a boxplot of non-log listing prices to be plotted. Here, we choose to filter out extreme outliers, which we defined as listings outside the 1.5 x IQR of log prices. The reason for using log-price outliers instead of price outliers is that this will exclude only the most unrealistically priced accommodations, and not those that are just quite fancy and luxurious.
# Find outliers that are to be excluded
exp(boxplot.stats(log(listings6$price))$out)
## [1] 9036 11275 10036 8830 33961 8438 10938 7924 10370 15849
## [11] 8232 9021 18113 20509 24821 10188 31697 23162 15011 11280
## [21] 10188 9057 7924 8908 12453 20382 11276 13154 8151 7994
## [31] 8842 10802 21854 9472 10665 0 9059 45 45 14041
## [41] 7926 10008 12592 9059 0 0 7994 16981 7926 13520
## [51] 17891 17891 19386 8266 22624 7926 11323 10395 12113 11320
## [61] 11867 25811 7992 15143 7924 11323 9001 10007 7893 26900
## [71] 99994 16256 7924 8447 8536 10007 14856 13588 11210 9059
## [81] 9059 7976 8264 19994 10007 8490 31682 66379 9172 13673
## [91] 8989 7926 18113 7992 8832 29999 8991 7926 9001 13996
## [101] 9059 10191 11320 9918 31946 10352 10191 7926 9917 12453
## [111] 9901 12376 10259 11323 226409 9782 13673 13585 29624 67940
## [121] 11321 15952 20379 8490 13171 13673 13157 49996 23 22641
## [131] 26730 7924 7893 34505 13588 7926 8198 11751 45 16098
## [141] 15008 9991 46 99996 35026 13588 13585 56617 45 18815
## [151] 9011 0 0 51068 0 12001 7926 19405 15849 8991
## [161] 13673 9115 15850 24022 33554 33954 33554 33676 33676 33761
## [171] 33676 33676 33676 33557 33557 33676 33676 33554 33954 33554
## [181] 33554 33554 33554 33676 33554 33954 33954 33557 33557 33676
## [191] 33676 33557 33676 33557 33676 33676 33554 33554 33676 33557
## [201] 33557 33557 33557 33554 33554 33954 33554 33954 33557 33954
## [211] 33527 33954 33557 33527 33527 33557 33557 33761 33554 33676
## [221] 33676 33557 33557 33954 33557 33557 33954 33554 33554 33554
## [231] 33954 33954 33557 33954 33954 33557 33557 33954 33557 33557
## [241] 33954 33954 33557 33557 33557 33557 33954 33557 33954 33554
## [251] 33954 33557 33557 33954 33527 33527 33527 33527 33954 33557
## [261] 33557 33554 33954 33557 33954 33954 33557 33557 33557 33557
## [271] 33557 33954 33554 33554 33954 33554 33554 33557 33557 33557
## [281] 33557 33954 33954 33554 33554 33676 33676 33761 33676 33676
## [291] 33676 33554 33954 33557 33557 33527 33527 33527 33527 33676
## [301] 33676 33557 33954 33527 33527 33761 33554 33761 33954 33954
## [311] 33557 33554 33554 33554 33554 33557 33557 33554 33554 33676
## [321] 33676 8309 9489 7994 7994
# exclude log(price outliers)
listings7 <- listings6 %>%
filter(!log(price) %in% boxplot.stats(log(price))$out)
# Check new distribution
listings7 %>%
ggplot() +
geom_boxplot(aes(x = reorder(prop_type_simplified, price, FUN = median), y=price)) +
labs(subtitle = "Price Distribution by Property Type",
title = "Price outliers within reasonable range",
y = "Price per night",
x = "Property type") +
theme(axis.title.x = element_blank())+
theme_economist() +
# Add dollar signs
scale_y_continuous(labels = scales::dollar_format())
# Check new distribution
listings7 %>%
ggplot() +
geom_density(aes(x=price)) +
# Facet by property type
facet_wrap(~prop_type_simplified, scale = "free_x") +
labs(subtitle = "Price Distribution by Property Type",
title = "Price outliers within reasonable range",
x = "Price per night (USD)") +
theme(axis.title.x = element_blank())+
theme_economist()
Our new distribution of price is more visually comprehensible, however when looking at density plots we still find a heavy right-skewness, which should be taken into account going forwards, especially as AirBnB’s maximum price filter is USD1000+
# CLook at distribution to ratings
listings7 %>%
# Change facet name of superhost (TRUE = superhost, FALSE, = not superhost)
mutate(host_is_superhost = if_else(host_is_superhost, "Superhost", "Not Superhost")) %>%
# Plot histogram
ggplot() +
geom_histogram( aes(x = review_scores_rating), binwidth = 1) +
labs(subtitle = "Irregularities at ratings of 20, 40, 60, and 80 may be due to
new listings with insufficient data, or AirBnB rounding algorithms for lower values",
title = "Ratings are left skewed - more so for superhosts") +
facet_wrap(~host_is_superhost) +
theme_economist()
# Investigate relation of Rating to Price
listings7 %>%
ggplot() +
geom_point( aes(x = review_scores_rating, y = price)) +
geom_smooth(aes(x = review_scores_rating, y = price), method='lm') +
labs(subtitle = "Irregularities at ratings of 20,and 40 may be due to
new listings with insufficient data, or AirBnB rounding algorithms for lower values",
title = "Well rated properties seem to charge more on average \n - but driven by hugely expensive properties") +
theme_economist()
Looking at review scores we saw that the ratings are left heavily skewed. This could likely be explained by the reciprocal rating mechanism of AirBnB, where hosts also rate the tenants and the tenants’ score impacts their likelihood of them being accepted to a new booking. Going forwards it could be considered to disregard low rated properties (e.g. <5) as these will have had substantial problems and might be outliers. A normal traveller wouldn’t consider properties with such low ratings.
That being said we saw a slight over-performance of superhosts compared to normal hosts. We also saw a tendency of better-rated properties to charge a higher price. However this seems to be especially driven by expensive properties.
# Look at distribution of cleaning fees
listings7 %>%
ggplot() +
geom_histogram( aes(x = cleaning_fee)) +
labs(subtitle = "These may be for more expensive properties",
title = "Some extraordinary cleaning fees are charged",
x = "Cleaning fees") +
theme_economist() +
# Add dollar sign
scale_x_continuous(labels = scales::dollar_format())
# Relate Cleaning fee to Price of minimum stay
listings7 %>%
filter(cleaning_fee >= 0) %>%
mutate(clean_to_price_min = cleaning_fee / (price * minimum_nights)) %>%
ggplot() +
geom_boxplot( aes(y = clean_to_price_min)) +
labs(title = "Cleaning Fee to Price Ratio",
subtitle = "In relation to price some cleaning fees are non-sensical and should be excluded") +
theme_economist()
# exclude clean to price outliers
listings8 <- listings7 %>%
mutate(clean_to_price_min = cleaning_fee / (price * minimum_nights)) %>%
filter(!clean_to_price_min %in% boxplot.stats(clean_to_price_min)$out)
# Find outliers that are being excluded
min(exp(boxplot.stats(listings7$cleaning_fee / (listings7$price * listings7$minimum_nights))$out))
## [1] 1.97
# Investigate relation of Rating to Price
listings8 %>%
ggplot() +
geom_point( aes(x = cleaning_fee, y = price)) +
geom_smooth(aes(x = cleaning_fee, y = price), method='lm') +
labs(subtitle = "Cleaning Fee vs Price",
title = "If cleaning fees are charged they increase in relation to price") +
theme_economist()
We initially looked at the distribution of cleaning fees and found some extraordinarily high fees being charged on some properties. Realising that a mansion costing USD1000+ per night might also be very expensive to clean, we decided to look at a different proxy for the logic behind our cleaning fee numbers - the cleaning fee to price per minimum stay ratio.
Plotting the boxplot of this ratio we still find many outliers that charge more than 2x the minimum stay price as a cleaning fee, which is not reasonable for a short-term holiday stay. This pricing strategy might be used to trick tenants to look at more expensive properties by making the price per night price artificially low. Whereas this might be legitimate it will significantly skew our findings as we can no longer understand whether part of a cleaning fee is actually the price per night. We therefore excluded all outliers of over 1.97x the cleaning fee to minimum stay price ratio.
## Create histogram to understand distribution of house size
listings8 %>%
ggplot()+
geom_histogram(aes(x= accommodates), binwidth = 1) +
labs(subtitle = "Accommodates (House Size) Histogram",
title = "Most properties accommodate less than 5 guests") +
theme_economist()
## Create scatterplot of house size and price
listings8 %>%
ggplot()+
geom_point(aes(x= accommodates, y = price)) +
# add smoothing line
geom_smooth(aes(x= accommodates, y = price), method = "lm") +
# add titles
labs(subtitle = "House Size vs Price",
title = "Propoerties that can accommodate more guests \ntend to be more expensive",
y = "Daily prices (USD)") +
theme_economist()
Looking at the accommodation variable we can see that the majority of properties in Mexico City accommodate less than 5 guests. We also find that accommodation, as a proxy for house size, is positively correlated to price.
# Look at distribution to ratings
listings8 %>%
summarise(log(price), review_scores_rating, accommodates, security_deposit, host_is_superhost) %>%
ggpairs(aes(colour = host_is_superhost, alpha = 0.4),
title = "Light blue represents superhosts, red represents normal hosts")
# ggsave to resize image
ggsave("combined_graphs.jpg", plot = last_plot(), width = 34, height = 18, units ="cm", dpi = 100)
knitr::include_graphics(here::here("content/projects/project4", "combined_graphs.jpg"), error = FALSE)
The graphic above shows us a lot about our data.
The intuitive conclusions:
The less expected information:
# create colour palette for property type
factpal <- colorFactor(topo.colors(5), listings8$property_type)
# create colour palette for price
percentile_price <- listings8 %>%
summarise(percentile_price = (price - min(price))/ (max(price) - min(price)))
# add dollar signs for price
listings8 <- listings8 %>%
mutate(content = paste(property_type, "for $",(price), sep = " "))
# Create map
leaflet(listings8) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
color = ~factpal(property_type),
popup = ~listing_url,
label = ~content)
For the target variable , we will use the cost for two people to stay at an Airbnb location for four (4) nights.
First, we check the data to see if there is any property that will only rent for less than 4 nights (maximum_nights < 4). This may be unrealistic, but it is possible that some hosts have a preference for short stays only, or that these listings are fake. Either way, if any property has a maximum nights of below 4, they need to be excluded.
count_maximum_nights <- listings8 %>%
# Count number of properties for each level of minimum nights
group_by(maximum_nights) %>%
summarise(count = n()) %>%
# Arrange in descending order
arrange(maximum_nights)
# Show list
count_maximum_nights
| maximum_nights | count |
|---|---|
| 1 | 21 |
| 2 | 31 |
| 3 | 56 |
| 4 | 48 |
| 5 | 95 |
| 6 | 49 |
| 7 | 254 |
| 8 | 61 |
| 9 | 7 |
| 10 | 169 |
| 11 | 9 |
| 12 | 27 |
| 13 | 2 |
| 14 | 79 |
| 15 | 328 |
| 16 | 13 |
| 17 | 9 |
| 18 | 8 |
| 19 | 2 |
| 20 | 116 |
| 21 | 36 |
| 22 | 4 |
| 23 | 4 |
| 24 | 6 |
| 25 | 25 |
| 26 | 2 |
| 27 | 8 |
| 28 | 174 |
| 29 | 47 |
| 30 | 1268 |
| 31 | 241 |
| 32 | 38 |
| 33 | 6 |
| 34 | 2 |
| 35 | 46 |
| 36 | 9 |
| 37 | 2 |
| 38 | 4 |
| 39 | 1 |
| 40 | 48 |
| 41 | 2 |
| 42 | 3 |
| 43 | 1 |
| 44 | 1 |
| 45 | 79 |
| 46 | 3 |
| 48 | 1 |
| 49 | 2 |
| 50 | 30 |
| 51 | 5 |
| 53 | 1 |
| 55 | 4 |
| 56 | 1 |
| 57 | 1 |
| 59 | 2 |
| 60 | 463 |
| 61 | 6 |
| 62 | 20 |
| 63 | 2 |
| 64 | 2 |
| 65 | 15 |
| 66 | 2 |
| 67 | 1 |
| 69 | 1 |
| 70 | 6 |
| 71 | 1 |
| 75 | 4 |
| 80 | 10 |
| 81 | 1 |
| 84 | 2 |
| 89 | 3 |
| 90 | 540 |
| 91 | 9 |
| 92 | 6 |
| 93 | 59 |
| 94 | 1 |
| 95 | 8 |
| 98 | 1 |
| 99 | 7 |
| 100 | 114 |
| 101 | 2 |
| 102 | 2 |
| 104 | 1 |
| 107 | 1 |
| 110 | 1 |
| 111 | 1 |
| 113 | 1 |
| 119 | 2 |
| 120 | 116 |
| 124 | 3 |
| 125 | 2 |
| 130 | 5 |
| 135 | 3 |
| 140 | 2 |
| 150 | 29 |
| 152 | 1 |
| 154 | 1 |
| 160 | 3 |
| 168 | 1 |
| 169 | 1 |
| 175 | 1 |
| 179 | 1 |
| 180 | 208 |
| 182 | 15 |
| 185 | 35 |
| 186 | 5 |
| 187 | 3 |
| 188 | 1 |
| 189 | 2 |
| 190 | 5 |
| 200 | 32 |
| 208 | 1 |
| 210 | 10 |
| 221 | 1 |
| 224 | 1 |
| 238 | 1 |
| 240 | 9 |
| 250 | 4 |
| 260 | 3 |
| 269 | 1 |
| 270 | 3 |
| 279 | 1 |
| 300 | 53 |
| 301 | 1 |
| 320 | 2 |
| 322 | 3 |
| 325 | 1 |
| 330 | 1 |
| 333 | 1 |
| 340 | 2 |
| 350 | 8 |
| 355 | 1 |
| 358 | 4 |
| 359 | 1 |
| 360 | 89 |
| 361 | 1 |
| 362 | 1 |
| 363 | 1 |
| 364 | 14 |
| 365 | 522 |
| 366 | 10 |
| 369 | 2 |
| 370 | 2 |
| 375 | 1 |
| 376 | 1 |
| 380 | 1 |
| 383 | 1 |
| 390 | 1 |
| 398 | 2 |
| 400 | 16 |
| 444 | 1 |
| 450 | 1 |
| 470 | 1 |
| 500 | 21 |
| 600 | 10 |
| 700 | 7 |
| 730 | 11 |
| 800 | 1 |
| 900 | 3 |
| 902 | 1 |
| 999 | 25 |
| 1e+03 | 50 |
| 1e+03 | 1 |
| 1e+03 | 1 |
| 1.08e+03 | 1 |
| 1.1e+03 | 1 |
| 1.1e+03 | 1 |
| 1.1e+03 | 3 |
| 1.11e+03 | 4 |
| 1.12e+03 | 2 |
| 1.12e+03 | 4 |
| 1.12e+03 | 2 |
| 1.12e+03 | 1 |
| 1.12e+03 | 3 |
| 1.12e+03 | 60 |
| 1.12e+03 | 8498 |
| 1.13e+03 | 1 |
| 1.13e+03 | 1 |
| 2e+03 | 1 |
| 3e+03 | 1 |
| 5e+03 | 2 |
| 1e+04 | 2 |
| 5e+04 | 2 |
We can see from the table above that there are over 100 accommodations with the maximum nights below 4. These will be excluded.
Next, we create a new variable called price_4_nights that uses price, cleaning_fee, accommodates, guests_included, and extra_people to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is will be our dependent variable.
Note: we first worked with the definition that extra_people is the charge for having more than 1 person. However, after further research, we have realised that this is INCORRECT. Extra_people fee is, in fact, the per person charge for every additional guest exceeding those included in the original prices, as long as the number of guests is not higher than the maximum number that the property accommodates. We will perform our analysis using this new definition.
listings9 <- listings8 %>%
# Filter out those with maximum nights below 4 or accommodates less than 2 as we cannot rent there
filter(maximum_nights >= 4,
accommodates >= 2) %>%
# Create new variable
mutate(price_4_nights = if_else(guests_included >= 2,
price * 4 + cleaning_fee, # Price if original rates already include 2 people or above
price * 4 + cleaning_fee + extra_people)) # Price if original rates only include 1 person
Use histograms or density plots to examine the distributions of price_4_nights and log(price_4_nights). Which variable should we use for the regression model? Why?
# Create simple density plot
listings9 %>%
ggplot() +
geom_density( aes(x = price_4_nights), binwidth = 100) +
labs(title = "Density plot of prices for 4 nights heavily right-skewed",
subtitle = "There is a strict zero lower bound to prices, but no strict upper bound",
x = "Total price of accommodation for 4 nights (USD)") +
theme_economist()
# Create logged density plot
listings9 %>%
ggplot() +
geom_density( aes(x = log(price_4_nights) ), binwidth = 100) +
labs(subtitle = "Log 4-nights prices is closer to normal distribution",
title = "No obvious skewed patterns",
x = "Log of total price of accommodation for 4 nights") +
theme_economist()
It looks like we should use the log-transformed price_4_nights variable as it is significantly less skewed and closer to normal distribution.
Our first regression model called model1 will start with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.
# Create regression
model1 <- lm(log(price_4_nights) ~ prop_type_simplified + number_of_reviews + review_scores_rating, data = listings9)
summary(model1)
##
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
## review_scores_rating, data = listings9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7723 -0.5000 -0.0178 0.4460 2.5209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.828347 0.089266 87.70 < 2e-16 ***
## prop_type_simplifiedCondominium 0.015047 0.025168 0.60 0.54995
## prop_type_simplifiedHouse -0.414243 0.020150 -20.56 < 2e-16 ***
## prop_type_simplifiedLoft 0.002488 0.026191 0.10 0.92431
## prop_type_simplifiedOther -0.110432 0.023618 -4.68 3e-06 ***
## number_of_reviews 0.000467 0.000133 3.51 0.00044 ***
## review_scores_rating 0.004775 0.000932 5.12 3e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.694 on 11335 degrees of freedom
## (1572 observations deleted due to missingness)
## Multiple R-squared: 0.0428, Adjusted R-squared: 0.0423
## F-statistic: 84.6 on 6 and 11335 DF, p-value: <2e-16
autoplot(model1) + theme_bw()
# Check for collinearity
vif(model1)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.03 4 1.00
## number_of_reviews 1.03 1 1.01
## review_scores_rating 1.01 1 1.00
# Summarise model estimates in dataframe
model1_estimates <- summary(model1)$coefficients[,1]
model1_estimates <- exp(model1_estimates)
model1_estimates <- stack(model1_estimates)
model1_estimates <- model1_estimates %>%
select(ind, values)
model1_estimates
| ind | values |
|---|---|
| (Intercept) | 2.51e+03 |
| prop_type_simplifiedCondominium | 1.02 |
| prop_type_simplifiedHouse | 0.661 |
| prop_type_simplifiedLoft | 1 |
| prop_type_simplifiedOther | 0.895 |
| number_of_reviews | 1 |
| review_scores_rating | 1 |
We don’t find any high levels of collinearity in terms of VIF. We therefore don’t have to exclude any independent variables.
Interpreting the coefficient review_scores_rating in terms of price_4_nights.
The effect of review score ratings on 4-nights rents is of strong statistical significance (it is non-zero even at 99.99% significance level). Ceteris paribus, an increase in review score ratings of 1 is expected to induce a 0.48% increase in 4-nights prices.
Interpreting the coefficient of prop_type_simplified in terms of price_4_nights.
Looking at our property types coefficients and p-values, we can find that changing from renting Apartments to renting ‘House’ and ‘Other’ induce effects of strong statistical significance (at least at 99% level of significance) on 4-nights rents:
Ceteris paribus, changing from renting Apartments to renting House is expected to lead to a 41.4% decrease in 4-nights prices.
Ceteris paribus, changing from renting Apartments to renting ‘Other’ is expected to lead to a 11% decrease in 4-nights prices.
Meanwhile, the effect of renting Lofts or Condominium is not statistically different from that of renting Apartments.
We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. Fit a regression model called model2 that includes all of the explanatory variables in model1 plus room_type.
# Create regression
model2 <- lm(log(price_4_nights) ~ prop_type_simplified + number_of_reviews + review_scores_rating + room_type,
data = listings9)
summary(model2)
##
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews +
## review_scores_rating + room_type, data = listings9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.979 -0.376 -0.043 0.336 2.668
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.137277 0.073236 111.11 < 2e-16 ***
## prop_type_simplifiedCondominium 0.037686 0.020610 1.83 0.06750 .
## prop_type_simplifiedHouse 0.073227 0.017757 4.12 3.8e-05 ***
## prop_type_simplifiedLoft -0.142949 0.021538 -6.64 3.3e-11 ***
## prop_type_simplifiedOther 0.089832 0.021120 4.25 2.1e-05 ***
## number_of_reviews -0.000378 0.000109 -3.45 0.00056 ***
## review_scores_rating 0.004462 0.000763 5.85 5.2e-09 ***
## room_typeHotel room -0.092666 0.051673 -1.79 0.07295 .
## room_typePrivate room -0.918969 0.012519 -73.41 < 2e-16 ***
## room_typeShared room -1.214563 0.066764 -18.19 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.568 on 11332 degrees of freedom
## (1572 observations deleted due to missingness)
## Multiple R-squared: 0.359, Adjusted R-squared: 0.358
## F-statistic: 704 on 9 and 11332 DF, p-value: <2e-16
autoplot(model2) + theme_bw()
# Check for collinearity
vif(model2)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.43 4 1.05
## number_of_reviews 1.04 1 1.02
## review_scores_rating 1.01 1 1.00
## room_type 1.43 3 1.06
# Summarise model estimates in dataframe
model2_estimates <- summary(model2)$coefficients[,1]
model2_estimates <- exp(model2_estimates)
model2_estimates <- stack(model2_estimates)
model2_estimates <- model2_estimates %>%
select(ind, values)
model2_estimates
| ind | values |
|---|---|
| (Intercept) | 3.42e+03 |
| prop_type_simplifiedCondominium | 1.04 |
| prop_type_simplifiedHouse | 1.08 |
| prop_type_simplifiedLoft | 0.867 |
| prop_type_simplifiedOther | 1.09 |
| number_of_reviews | 1 |
| review_scores_rating | 1 |
| room_typeHotel room | 0.911 |
| room_typePrivate room | 0.399 |
| room_typeShared room | 0.297 |
We can see that room type is a statistically significant determinant of 4-night prices:
Moving from renting entire house/apartment to renting private rooms is expected to induce a price decrease of 91.9%
Moving from renting entire house/apartment to renting shared rooms is expected to induce a price decrease of 121.5%. However, this is an unrealistic assumption, as it means that we can expect to receive money if we were to rent shared rooms. Hence, we can only interpret from the data that renting a shared room will induce a percentage fall in 4-night prices even greater than that of renting private rooms.
Our dataset has many more variables, so here we are going to extend our analysis to further explore the data.
As a standalone model
# Create regression
model3 <- lm(log(price_4_nights) ~ beds + bathrooms + accommodates,
data = listings9)
summary(model3)
##
## Call:
## lm(formula = log(price_4_nights) ~ beds + bathrooms + accommodates,
## data = listings9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.947 -0.426 0.000 0.427 2.223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.64672 0.01201 636.6 <2e-16 ***
## beds -0.07745 0.00522 -14.8 <2e-16 ***
## bathrooms 0.10910 0.00700 15.6 <2e-16 ***
## accommodates 0.16901 0.00396 42.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.638 on 12811 degrees of freedom
## (99 observations deleted due to missingness)
## Multiple R-squared: 0.209, Adjusted R-squared: 0.209
## F-statistic: 1.13e+03 on 3 and 12811 DF, p-value: <2e-16
autoplot(model3) + theme_bw()
# Check for collinearity
vif(model3)
## beds bathrooms accommodates
## 2.73 1.55 2.28
# Summarise model estimates in dataframe
model3_estimates <- summary(model3)$coefficients[,1]
model3_estimates <- exp(model3_estimates)
model3_estimates <- stack(model3_estimates)
model3_estimates <- model3_estimates %>%
select(ind, values)
model3_estimates
| ind | values |
|---|---|
| (Intercept) | 2.09e+03 |
| beds | 0.925 |
| bathrooms | 1.12 |
| accommodates | 1.18 |
The number of beds, bathrooms, and the maximum number of people accommodated by a property have statistically significant effects on prices. From the data, adding 1 bathroom is expected to increase 4-night prices by 10.9% and expanding the maximum number of people accommodated by 1 is expected to increase 4-night prices by 16.9%. However, from the data it seems that by adding 1 additional bed, we would expect 4-night prices to fall by 7.6%. This seems illogical. However, it may be because houses that register more beds are those with many temporary, fold-able beds or sofa beds. This type of setting may make the property looks messy, so it only suits cheaper budget accommodations where the presentation of the property is of lesser importance.
As part of extended model
# Create regression
model4 <- lm(log(price_4_nights) ~ beds + bathrooms + accommodates + prop_type_simplified + number_of_reviews + review_scores_rating + room_type,
data = listings9)
summary(model4)
##
## Call:
## lm(formula = log(price_4_nights) ~ beds + bathrooms + accommodates +
## prop_type_simplified + number_of_reviews + review_scores_rating +
## room_type, data = listings9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.954 -0.347 -0.026 0.318 2.582
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.589522 0.070477 107.69 < 2e-16 ***
## beds -0.059573 0.004866 -12.24 < 2e-16 ***
## bathrooms 0.128271 0.006729 19.06 < 2e-16 ***
## accommodates 0.103888 0.004034 25.75 < 2e-16 ***
## prop_type_simplifiedCondominium 0.016324 0.019315 0.85 0.3980
## prop_type_simplifiedHouse -0.016402 0.016813 -0.98 0.3293
## prop_type_simplifiedLoft -0.017101 0.020425 -0.84 0.4025
## prop_type_simplifiedOther 0.084507 0.019874 4.25 2.1e-05 ***
## number_of_reviews -0.000284 0.000103 -2.77 0.0057 **
## review_scores_rating 0.004974 0.000720 6.91 5.1e-12 ***
## room_typeHotel room 0.000705 0.048484 0.01 0.9884
## room_typePrivate room -0.724805 0.012969 -55.89 < 2e-16 ***
## room_typeShared room -1.219237 0.063750 -19.13 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.531 on 11280 degrees of freedom
## (1621 observations deleted due to missingness)
## Multiple R-squared: 0.44, Adjusted R-squared: 0.439
## F-statistic: 738 on 12 and 11280 DF, p-value: <2e-16
autoplot(model4) + theme_bw()
# Check for collinearity
vif(model4)
## GVIF Df GVIF^(1/(2*Df))
## beds 2.88 1 1.70
## bathrooms 1.54 1 1.24
## accommodates 2.83 1 1.68
## prop_type_simplified 1.52 4 1.05
## number_of_reviews 1.04 1 1.02
## review_scores_rating 1.01 1 1.00
## room_type 1.84 3 1.11
# Summarise model estimates in dataframe
model4_estimates <- summary(model4)$coefficients[,1]
model4_estimates <- exp(model4_estimates)
model4_estimates <- stack(model4_estimates)
model4_estimates <- model4_estimates %>%
select(ind, values)
model4_estimates
| ind | values |
|---|---|
| (Intercept) | 1.98e+03 |
| beds | 0.942 |
| bathrooms | 1.14 |
| accommodates | 1.11 |
| prop_type_simplifiedCondominium | 1.02 |
| prop_type_simplifiedHouse | 0.984 |
| prop_type_simplifiedLoft | 0.983 |
| prop_type_simplifiedOther | 1.09 |
| number_of_reviews | 1 |
| review_scores_rating | 1 |
| room_typeHotel room | 1 |
| room_typePrivate room | 0.484 |
| room_typeShared room | 0.295 |
Again, the number of bathrooms and the maximum number of people accommodated by the listing are positively correlated to 4-nights prices, while beds quantity is negatively correlated. However, what is more interesting here is that adding these 3 variables render the effect of most property types statistically insignificant.
# Create regression
model6 <- lm(log(price_4_nights) ~ prop_type_simplified + beds + bathrooms + accommodates + number_of_reviews + review_scores_rating + room_type + host_is_superhost + is_location_exact,
data = listings9)
summary(model6)
##
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + beds +
## bathrooms + accommodates + number_of_reviews + review_scores_rating +
## room_type + host_is_superhost + is_location_exact, data = listings9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.911 -0.346 -0.025 0.312 2.610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.645062 0.072699 105.16 < 2e-16 ***
## prop_type_simplifiedCondominium 0.016723 0.019215 0.87 0.38
## prop_type_simplifiedHouse -0.016839 0.016743 -1.01 0.31
## prop_type_simplifiedLoft -0.012194 0.020336 -0.60 0.55
## prop_type_simplifiedOther 0.089465 0.019778 4.52 6.1e-06 ***
## beds -0.058622 0.004842 -12.11 < 2e-16 ***
## bathrooms 0.127357 0.006695 19.02 < 2e-16 ***
## accommodates 0.103462 0.004014 25.78 < 2e-16 ***
## number_of_reviews -0.000438 0.000104 -4.22 2.5e-05 ***
## review_scores_rating 0.003091 0.000747 4.14 3.5e-05 ***
## room_typeHotel room 0.004840 0.048241 0.10 0.92
## room_typePrivate room -0.715511 0.012937 -55.31 < 2e-16 ***
## room_typeShared room -1.217241 0.063434 -19.19 < 2e-16 ***
## host_is_superhostTRUE 0.090513 0.010667 8.49 < 2e-16 ***
## is_location_exactTRUE 0.095795 0.014306 6.70 2.2e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.528 on 11278 degrees of freedom
## (1621 observations deleted due to missingness)
## Multiple R-squared: 0.446, Adjusted R-squared: 0.445
## F-statistic: 647 on 14 and 11278 DF, p-value: <2e-16
autoplot(model6) + theme_bw()
# Check for collinearity
vif(model6)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.52 4 1.05
## beds 2.88 1 1.70
## bathrooms 1.54 1 1.24
## accommodates 2.83 1 1.68
## number_of_reviews 1.08 1 1.04
## review_scores_rating 1.10 1 1.05
## room_type 1.85 3 1.11
## host_is_superhost 1.15 1 1.07
## is_location_exact 1.00 1 1.00
# Summarise model estimates in dataframe
model6_estimates <- summary(model6)$coefficients[,1]
model6_estimates <- exp(model6_estimates)
model6_estimates <- stack(model6_estimates)
model6_estimates <- model6_estimates %>%
select(ind, values)
model6_estimates
| ind | values |
|---|---|
| (Intercept) | 2.09e+03 |
| prop_type_simplifiedCondominium | 1.02 |
| prop_type_simplifiedHouse | 0.983 |
| prop_type_simplifiedLoft | 0.988 |
| prop_type_simplifiedOther | 1.09 |
| beds | 0.943 |
| bathrooms | 1.14 |
| accommodates | 1.11 |
| number_of_reviews | 1 |
| review_scores_rating | 1 |
| room_typeHotel room | 1 |
| room_typePrivate room | 0.489 |
| room_typeShared room | 0.296 |
| host_is_superhostTRUE | 1.09 |
| is_location_exactTRUE | 1.1 |
The effect of specifying the property’s exact location on prices is statistically significant: Ceteris paribus, staying at a place with exact location specified is associated with an increase in 4-nights rents of 9.6%%.
After controlling for other variables, is a listing’s exact location a significant predictor of price_4_nights?
listings10 <- listings9 %>%
# Create a new variables to regroup neighbourhoods
mutate(neighbourhood_simplified = case_when(neighbourhood_cleansed %in% c("Cuauhtémoc",
"Coyoacán",
"Miguel Hidalgo",
"Benito Juárez",
"Tlalpan","
Venustiano Carranza") ~ "Touristy",
neighbourhood_cleansed %in% c("Iztacalco",
"Iztapalapa",
"Tláhuac",
"Milpa Alta") ~ "Least Popular",
TRUE ~ "Less Popular"))
As we could not break the boroughs down by geographical or administrative divisions, we decided to separate neighbourhoods by attractiveness to tourists: touristy areas, less popular areas, and unpopular areas (rough areas or those far from centre).
# Create regression
model7 <- lm(log(price_4_nights) ~ prop_type_simplified + beds + bathrooms + accommodates + number_of_reviews + review_scores_rating + room_type + host_is_superhost + is_location_exact + neighbourhood_simplified,
data = listings10)
summary(model7)
##
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + beds +
## bathrooms + accommodates + number_of_reviews + review_scores_rating +
## room_type + host_is_superhost + is_location_exact + neighbourhood_simplified,
## data = listings10)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.783 -0.337 -0.029 0.308 2.530
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.092041 0.076339 92.90 < 2e-16 ***
## prop_type_simplifiedCondominium 0.022441 0.018758 1.20 0.23
## prop_type_simplifiedHouse 0.024890 0.016470 1.51 0.13
## prop_type_simplifiedLoft -0.014752 0.019847 -0.74 0.46
## prop_type_simplifiedOther 0.115113 0.019357 5.95 2.8e-09 ***
## beds -0.054244 0.004729 -11.47 < 2e-16 ***
## bathrooms 0.119558 0.006542 18.28 < 2e-16 ***
## accommodates 0.103445 0.003917 26.41 < 2e-16 ***
## number_of_reviews -0.000600 0.000102 -5.90 3.8e-09 ***
## review_scores_rating 0.003249 0.000729 4.46 8.4e-06 ***
## room_typeHotel room -0.025764 0.047095 -0.55 0.58
## room_typePrivate room -0.710377 0.012627 -56.26 < 2e-16 ***
## room_typeShared room -1.197298 0.061926 -19.33 < 2e-16 ***
## host_is_superhostTRUE 0.081382 0.010416 7.81 6.1e-15 ***
## is_location_exactTRUE 0.075717 0.013999 5.41 6.5e-08 ***
## neighbourhood_simplifiedLess Popular 0.383522 0.032268 11.89 < 2e-16 ***
## neighbourhood_simplifiedTouristy 0.600633 0.029453 20.39 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.516 on 11276 degrees of freedom
## (1621 observations deleted due to missingness)
## Multiple R-squared: 0.472, Adjusted R-squared: 0.471
## F-statistic: 630 on 16 and 11276 DF, p-value: <2e-16
autoplot(model7) + theme_bw()
# Check for collinearity
vif(model7)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.55 4 1.06
## beds 2.89 1 1.70
## bathrooms 1.54 1 1.24
## accommodates 2.83 1 1.68
## number_of_reviews 1.09 1 1.04
## review_scores_rating 1.10 1 1.05
## room_type 1.85 3 1.11
## host_is_superhost 1.15 1 1.07
## is_location_exact 1.01 1 1.00
## neighbourhood_simplified 1.05 2 1.01
# Summarise model estimates in dataframe
model7_estimates <- summary(model7)$coefficients[,1]
model7_estimates <- exp(model7_estimates)
model7_estimates <- stack(model7_estimates)
model7_estimates <- model7_estimates %>%
select(ind, values)
model7_estimates
| ind | values |
|---|---|
| (Intercept) | 1.2e+03 |
| prop_type_simplifiedCondominium | 1.02 |
| prop_type_simplifiedHouse | 1.03 |
| prop_type_simplifiedLoft | 0.985 |
| prop_type_simplifiedOther | 1.12 |
| beds | 0.947 |
| bathrooms | 1.13 |
| accommodates | 1.11 |
| number_of_reviews | 0.999 |
| review_scores_rating | 1 |
| room_typeHotel room | 0.975 |
| room_typePrivate room | 0.491 |
| room_typeShared room | 0.302 |
| host_is_superhostTRUE | 1.08 |
| is_location_exactTRUE | 1.08 |
| neighbourhood_simplifiedLess Popular | 1.47 |
| neighbourhood_simplifiedTouristy | 1.82 |
As expected, the type of neighbourhood is highly correlated with rents. Ceteris paribus, comparing to staying at Least Popular areas, staying at Touristy and Less Popular places is associated with an increase in 4-nights rents of 60.1% and 38.4% respectively.
model8 <- lm(log(price_4_nights) ~ prop_type_simplified + beds + bathrooms + accommodates + number_of_reviews + review_scores_rating + room_type + host_is_superhost + is_location_exact + neighbourhood_simplified + cancellation_policy,
data = listings10)
summary(model8)
##
## Call:
## lm(formula = log(price_4_nights) ~ prop_type_simplified + beds +
## bathrooms + accommodates + number_of_reviews + review_scores_rating +
## room_type + host_is_superhost + is_location_exact + neighbourhood_simplified +
## cancellation_policy, data = listings10)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.735 -0.335 -0.026 0.306 2.519
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 7.059924 0.076081 92.80
## prop_type_simplifiedCondominium 0.027316 0.018673 1.46
## prop_type_simplifiedHouse 0.027065 0.016390 1.65
## prop_type_simplifiedLoft -0.012802 0.019754 -0.65
## prop_type_simplifiedOther 0.123647 0.019281 6.41
## beds -0.052164 0.004710 -11.08
## bathrooms 0.117414 0.006513 18.03
## accommodates 0.099958 0.003912 25.55
## number_of_reviews -0.000688 0.000102 -6.76
## review_scores_rating 0.003367 0.000726 4.64
## room_typeHotel room -0.027584 0.046994 -0.59
## room_typePrivate room -0.702306 0.012612 -55.69
## room_typeShared room -1.186791 0.061657 -19.25
## host_is_superhostTRUE 0.073386 0.010419 7.04
## is_location_exactTRUE 0.077462 0.013935 5.56
## neighbourhood_simplifiedLess Popular 0.378474 0.032116 11.78
## neighbourhood_simplifiedTouristy 0.590477 0.029340 20.13
## cancellation_policymoderate 0.040541 0.011391 3.56
## cancellation_policystrict_14_with_grace_period 0.114395 0.012701 9.01
## cancellation_policysuper_strict_30 0.951945 0.148801 6.40
## cancellation_policysuper_strict_60 0.051269 0.515010 0.10
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## prop_type_simplifiedCondominium 0.14354
## prop_type_simplifiedHouse 0.09870 .
## prop_type_simplifiedLoft 0.51695
## prop_type_simplifiedOther 1.5e-10 ***
## beds < 2e-16 ***
## bathrooms < 2e-16 ***
## accommodates < 2e-16 ***
## number_of_reviews 1.5e-11 ***
## review_scores_rating 3.5e-06 ***
## room_typeHotel room 0.55724
## room_typePrivate room < 2e-16 ***
## room_typeShared room < 2e-16 ***
## host_is_superhostTRUE 2.0e-12 ***
## is_location_exactTRUE 2.8e-08 ***
## neighbourhood_simplifiedLess Popular < 2e-16 ***
## neighbourhood_simplifiedTouristy < 2e-16 ***
## cancellation_policymoderate 0.00037 ***
## cancellation_policystrict_14_with_grace_period < 2e-16 ***
## cancellation_policysuper_strict_30 1.6e-10 ***
## cancellation_policysuper_strict_60 0.92070
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.513 on 11272 degrees of freedom
## (1621 observations deleted due to missingness)
## Multiple R-squared: 0.478, Adjusted R-squared: 0.477
## F-statistic: 515 on 20 and 11272 DF, p-value: <2e-16
autoplot(model8) + theme_bw()
# Check for collinearity
vif(model8)
## GVIF Df GVIF^(1/(2*Df))
## prop_type_simplified 1.56 4 1.06
## beds 2.89 1 1.70
## bathrooms 1.55 1 1.24
## accommodates 2.85 1 1.69
## number_of_reviews 1.10 1 1.05
## review_scores_rating 1.10 1 1.05
## room_type 1.88 3 1.11
## host_is_superhost 1.16 1 1.08
## is_location_exact 1.01 1 1.01
## neighbourhood_simplified 1.05 2 1.01
## cancellation_policy 1.09 4 1.01
# Summarise model estimates in dataframe
model8_estimates <- summary(model8)$coefficients[,1]
model8_estimates <- exp(model8_estimates)
model8_estimates <- stack(model8_estimates)
model8_estimates <- model8_estimates %>%
select(ind, values)
model8_estimates
| ind | values |
|---|---|
| (Intercept) | 1.16e+03 |
| prop_type_simplifiedCondominium | 1.03 |
| prop_type_simplifiedHouse | 1.03 |
| prop_type_simplifiedLoft | 0.987 |
| prop_type_simplifiedOther | 1.13 |
| beds | 0.949 |
| bathrooms | 1.12 |
| accommodates | 1.11 |
| number_of_reviews | 0.999 |
| review_scores_rating | 1 |
| room_typeHotel room | 0.973 |
| room_typePrivate room | 0.495 |
| room_typeShared room | 0.305 |
| host_is_superhostTRUE | 1.08 |
| is_location_exactTRUE | 1.08 |
| neighbourhood_simplifiedLess Popular | 1.46 |
| neighbourhood_simplifiedTouristy | 1.8 |
| cancellation_policymoderate | 1.04 |
| cancellation_policystrict_14_with_grace_period | 1.12 |
| cancellation_policysuper_strict_30 | 2.59 |
| cancellation_policysuper_strict_60 | 1.05 |
The effects of cancellation policy on accommodation rents is statistically significant. Generally speaking, the stricter the cancellation policy, the higher the accommodation price. An exception is the cancellation policy “Super Strict 60 Days”, which is statistically indifferent from Flexible Cancellation policy. This may be because it is too extreme, and thus is only used by hosts of very small listings who need to maximise their earnings regardless of their customer service, or simply because there are not enough data points for this type of policy.
As for the other policies, moving from Flexible policy to Moderate, Strict (with grace period), and “Super Strict 30 Days” cancellation policy is expected to lead to a 4-nights price increase of 4.1%, 11.4%, and 95.1% respectively.
huxreg(
# Choose models to include and set their names
list("Property type and review stats" = model1,
"Room type added" = model2,
"Propery size added" = model4,
"Superhost added" = model5,
"Exact location availability added" = model6,
"Neighborhood added" = model7,
"Cancellation policy added" = model8),
# Choose what statistics to display at the bottom
statistics = c("#Observation" = "nobs",
"Adjusted R2" = "adj.r.squared",
"Residual SE" = "sigma"),
# Bold coefficients that are statistically significant at 95% significance level
bold_signif = 0.05) %>%
# Set table title
set_caption("Comparison of models analysing 4-night Airbnb prices in Mexico City")
| Property type and review stats | Room type added | Propery size added | Superhost added | Exact location availability added | Neighborhood added | Cancellation policy added | |
|---|---|---|---|---|---|---|---|
| (Intercept) | 7.828 *** | 8.137 *** | 7.590 *** | 7.722 *** | 7.645 *** | 7.092 *** | 7.060 *** |
| (0.089) | (0.073) | (0.070) | (0.072) | (0.073) | (0.076) | (0.076) | |
| prop_type_simplifiedCondominium | 0.015 | 0.038 | 0.016 | 0.017 | 0.017 | 0.022 | 0.027 |
| (0.025) | (0.021) | (0.019) | (0.019) | (0.019) | (0.019) | (0.019) | |
| prop_type_simplifiedHouse | -0.414 *** | 0.073 *** | -0.016 | -0.021 | -0.017 | 0.025 | 0.027 |
| (0.020) | (0.018) | (0.017) | (0.017) | (0.017) | (0.016) | (0.016) | |
| prop_type_simplifiedLoft | 0.002 | -0.143 *** | -0.017 | -0.011 | -0.012 | -0.015 | -0.013 |
| (0.026) | (0.022) | (0.020) | (0.020) | (0.020) | (0.020) | (0.020) | |
| prop_type_simplifiedOther | -0.110 *** | 0.090 *** | 0.085 *** | 0.088 *** | 0.089 *** | 0.115 *** | 0.124 *** |
| (0.024) | (0.021) | (0.020) | (0.020) | (0.020) | (0.019) | (0.019) | |
| number_of_reviews | 0.000 *** | -0.000 *** | -0.000 ** | -0.000 *** | -0.000 *** | -0.001 *** | -0.001 *** |
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | |
| review_scores_rating | 0.005 *** | 0.004 *** | 0.005 *** | 0.003 *** | 0.003 *** | 0.003 *** | 0.003 *** |
| (0.001) | (0.001) | (0.001) | (0.001) | (0.001) | (0.001) | (0.001) | |
| room_typeHotel room | -0.093 | 0.001 | 0.007 | 0.005 | -0.026 | -0.028 | |
| (0.052) | (0.048) | (0.048) | (0.048) | (0.047) | (0.047) | ||
| room_typePrivate room | -0.919 *** | -0.725 *** | -0.717 *** | -0.716 *** | -0.710 *** | -0.702 *** | |
| (0.013) | (0.013) | (0.013) | (0.013) | (0.013) | (0.013) | ||
| room_typeShared room | -1.215 *** | -1.219 *** | -1.212 *** | -1.217 *** | -1.197 *** | -1.187 *** | |
| (0.067) | (0.064) | (0.064) | (0.063) | (0.062) | (0.062) | ||
| beds | -0.060 *** | -0.059 *** | -0.059 *** | -0.054 *** | -0.052 *** | ||
| (0.005) | (0.005) | (0.005) | (0.005) | (0.005) | |||
| bathrooms | 0.128 *** | 0.128 *** | 0.127 *** | 0.120 *** | 0.117 *** | ||
| (0.007) | (0.007) | (0.007) | (0.007) | (0.007) | |||
| accommodates | 0.104 *** | 0.104 *** | 0.103 *** | 0.103 *** | 0.100 *** | ||
| (0.004) | (0.004) | (0.004) | (0.004) | (0.004) | |||
| host_is_superhostTRUE | 0.092 *** | 0.091 *** | 0.081 *** | 0.073 *** | |||
| (0.011) | (0.011) | (0.010) | (0.010) | ||||
| is_location_exactTRUE | 0.096 *** | 0.076 *** | 0.077 *** | ||||
| (0.014) | (0.014) | (0.014) | |||||
| neighbourhood_simplifiedLess Popular | 0.384 *** | 0.378 *** | |||||
| (0.032) | (0.032) | ||||||
| neighbourhood_simplifiedTouristy | 0.601 *** | 0.590 *** | |||||
| (0.029) | (0.029) | ||||||
| cancellation_policymoderate | 0.041 *** | ||||||
| (0.011) | |||||||
| cancellation_policystrict_14_with_grace_period | 0.114 *** | ||||||
| (0.013) | |||||||
| cancellation_policysuper_strict_30 | 0.952 *** | ||||||
| (0.149) | |||||||
| cancellation_policysuper_strict_60 | 0.051 | ||||||
| (0.515) | |||||||
| #Observation | 11342 | 11342 | 11293 | 11293 | 11293 | 11293 | 11293 |
| Adjusted R2 | 0.042 | 0.358 | 0.439 | 0.443 | 0.445 | 0.471 | 0.477 |
| Residual SE | 0.694 | 0.568 | 0.531 | 0.530 | 0.528 | 0.516 | 0.513 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||||||
Finally, we will use the best model we came up with for prediction. Suppose you are planning to visit the city over reading week, and you want to stay in an Airbnb. Let’s assume you want to find Airbnb’s that are apartment with a private room, have at least 10 reviews, and an average rating of at least 90. Then, we will use my best model to predict the total cost to stay at this Airbnb for 4 nights for you.
Summary:
Apartment Private Room at least 10 reviews average rating at least 90
Added assumption: 1 bed, flexible cancellation policy, 1 bathroom, accommodates 2, less popular neighbourhood, not superhost, location is exact
Model to use: model8
Rationale: it has the highest R-squared, and is most extensive
imarginary_stay <-
# Create new data frame with imaginary scenario
tibble(prop_type_simplified = "Apartment",
beds = 1,
bathrooms = 1,
accommodates = 2,
number_of_reviews = 10,
review_scores_rating = 90,
room_type = "Private room",
host_is_superhost = FALSE,
is_location_exact = TRUE,
neighbourhood_simplified = "Less Popular",
cancellation_policy = "flexible")
model_prediction <-
# Save prediction as data frame so it can be mutated later
data.frame(
# Make prediction using model8 and imaginary data set
predict(model8,
newdata = imarginary_stay,
# Set CI of prediction
interval = "prediction")) %>%
# convert from log to USD values
mutate(price = exp(fit),
CI_lower1 = exp(lwr),
CI_upper2 = exp(upr),
CI_lower = sprintf("%.2f", CI_lower1, na.rm = TRUE),
CI_upper = sprintf("%.2f", CI_upper2, na.rm = TRUE)) %>%
# Remove unwanted columns
select(price, CI_lower, CI_upper)
#Show result
model_prediction
| price | CI_lower | CI_upper |
|---|---|---|
| 1.6e+03 | 583.09 | 4365.10 |